Draft: Consume parquets instead of static files

Matthias Mullie requested to merge T339129_2 into main

This changes to start consuming parquets (that start to be generated since MR29) instead of the static input files bundled within this repo.

In addition to that, the commit to update the denylist ingestion also includes some refactoring to get rid of some code duplication for normalising section titles: now that the denylist has also become a parquet, we can get rid of the plain Python implementation and stick with only the PySpark version. This also includes further changes to where denylisted rows are being filtered out; it now happens near the end of main instead of bundling it within extract_sections

I also removed work-dir argument, expecting a more complete path for each argument. Makes things simpler when data is not all consolidated in the same dir.

I ran these scripts a couple of times:

Current (with bundled inputs):

$ python section_topics/ 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables'/user/mlitn/section_topics/2023-11-20').count() # 250121777

After 6cfec7d7 (qid-filter parquets):

$ python section_topics/ 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables --qid-filter=qids_for_all_points_in_time --qid-filter=qids_for_media_outlets'/user/mlitn/section_topics/2023-11-20').count() # 251654789

After d619d486 (qid-filter + section-title-filter parquets):

$ python section_topics/ 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables --qid-filter=qids_for_all_points_in_time --qid-filter=qids_for_media_outlets --section-title-filter=section_titles_denylist'/user/mlitn/section_topics/2023-11-20').count() # 143412657

Notice the significant drop in results! These are caused by the much bigger denylist, which in turn is caused by it using the new SEAL alignment instead of the old one. AFAICT, the new alignment data (and as a result, the new denylist) seems fine, and are indeed the kind of sections we want to exclude.

Final T339129_2 (with changes to path):

$ python section_topics/ 2023-11-20 --page-filter=/user/mlitn/section_topics/2022-10_ptwiki_bad --table-filter=/user/mlitn/section_topics/20231120_target_wikis_tables --section-title-filter=/user/mlitn/section_topics/section_titles_denylist --qid-filter=/user/mlitn/section_topics/qids_for_all_points_in_time --qid-filter=/user/mlitn/section_topics/qids_for_media_outlets --output=/user/mlitn/section_topics/2023-11-20'/user/mlitn/section_topics/2023-11-20').count() # 143412657

Bug: T339129

Edited by Matthias Mullie

