Draft: Consume parquets instead of static files
This changes pipeline.py to start consuming parquets (that start to be generated since MR29) instead of the static input files bundled within this repo.
In addition to that, the commit to update the denylist
ingestion also includes some refactoring to get rid of
some code duplication for normalising section titles:
now that the denylist has also become a parquet, we can
get rid of the plain Python implementation and stick
with only the PySpark version.
This also includes further changes to where denylisted
rows are being filtered out; it now happens near the end
of main
instead of bundling it within extract_sections
I also removed work-dir
argument, expecting a more
complete path for each argument. Makes things simpler
when data is not all consolidated in the same dir.
I ran these scripts a couple of times:
Current (with bundled inputs):
$ python section_topics/pipeline.py 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 250121777
After 6cfec7d7 (qid-filter parquets):
$ python section_topics/pipeline.py 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables --qid-filter=qids_for_all_points_in_time --qid-filter=qids_for_media_outlets
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 251654789
After d619d486 (qid-filter + section-title-filter parquets):
$ python section_topics/pipeline.py 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables --qid-filter=qids_for_all_points_in_time --qid-filter=qids_for_media_outlets --section-title-filter=section_titles_denylist
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 143412657
Notice the significant drop in results! These are caused by the much bigger denylist, which in turn is caused by it using the new SEAL alignment instead of the old one. AFAICT, the new alignment data (and as a result, the new denylist) seems fine, and are indeed the kind of sections we want to exclude.
Final T339129_2 (with changes to path):
$ python section_topics/pipeline.py 2023-11-20 --page-filter=/user/mlitn/section_topics/2022-10_ptwiki_bad --table-filter=/user/mlitn/section_topics/20231120_target_wikis_tables --section-title-filter=/user/mlitn/section_topics/section_titles_denylist --qid-filter=/user/mlitn/section_topics/qids_for_all_points_in_time --qid-filter=/user/mlitn/section_topics/qids_for_media_outlets --output=/user/mlitn/section_topics/2023-11-20
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 143412657
Bug: T339129