First iterations on topic relevance score and section exclusion
This MR introduces two features:
Relevance score details
- input = section topics dataframe
- output = same as input + raw TF, IDF numerator, IDF denominator, IDF, and TF-IDF columns
- workflow
- filter null topic QIDs
- compute cross-wiki raw TF: raw count of a topic QID in a page QID across wikis
- join with input on page QID and topic QID
- compute intra-wiki IDF numerator: count of page QIDs in a wiki
- compute intra-wiki IDF denominator: count of page QIDs where a topic QID occurs, in a wiki
- compute IDF: log(numerator / denominator)
- join with input on wiki and topic QID
- compute TF-IDF
Section exclusion
We currently exclude External links, Further reading, and References in all Wikipedias covered by section alignment.
@mnz : Diego has already given thumbs up to the relevance score implementation, see https://phabricator.wikimedia.org/T314863#8299743.