Filter bad pages by revision ID
This MR introduces a script that checks for badly parsed pages against the Action API and implements the corresponding filter in the pipeline.
- Closes https://phabricator.wikimedia.org/T323489
- optional CLI arg that takes a HDFS parquet with wiki_db, revision_id rows
- filter rows out from the initial dataframe of pages with wikitext
- the default is currently badly parsed ptwiki pages, as output by scripts/check_bad_parsing.py
- tests & refactoring