Skip to content

Filter bad pages by revision ID

Marco Fossati requested to merge dev into main

This MR introduces a script that checks for badly parsed pages against the Action API and implements the corresponding filter in the pipeline.

  • Closes https://phabricator.wikimedia.org/T323489
  • optional CLI arg that takes a HDFS parquet with wiki_db, revision_id rows
  • filter rows out from the initial dataframe of pages with wikitext
  • the default is currently badly parsed ptwiki pages, as output by scripts/check_bad_parsing.py
  • tests & refactoring

Merge request reports