Skip to content

First iterations on topic relevance score and section exclusion

Marco Fossati requested to merge dev into main

This MR introduces two features:

  1. article-level relevance score, closes T314863
  2. section exclusion, closes T318092

Relevance score details

  • input = section topics dataframe
  • output = same as input + raw TF, IDF numerator, IDF denominator, IDF, and TF-IDF columns
  • workflow
    1. filter null topic QIDs
    2. compute cross-wiki raw TF: raw count of a topic QID in a page QID across wikis
    3. join with input on page QID and topic QID
    4. compute intra-wiki IDF numerator: count of page QIDs in a wiki
    5. compute intra-wiki IDF denominator: count of page QIDs where a topic QID occurs, in a wiki
    6. compute IDF: log(numerator / denominator)
    7. join with input on wiki and topic QID
    8. compute TF-IDF

Section exclusion

We currently exclude External links, Further reading, and References in all Wikipedias covered by section alignment.

@mnz : Diego has already given thumbs up to the relevance score implementation, see https://phabricator.wikimedia.org/T314863#8299743.

Merge request reports