Accept denylist as parquet
Like we also saw in MR30 in section-topics, a fresh denylist run is much more extensive (caused by it using the new SEAL alignment instead of the old one) than the one currently used/bundled in this repo. The denylist seems alright, but it results in far fewer results - perhaps we ought to discuss whether we want to keep all currently denylisted sections, or filter less agressively.
Current master:
python image_suggestions/section_image_suggestions.py analytics_platform_eng 2023-11-20 /user/analytics-platform-eng/structured-data/section_topics/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images/2023-11-20 --output=/user/mlitn/image_suggestions/section_image_suggestions_old
spark.read.parquet('/user/mlitn/image_suggestions/section_image_suggestions_old').count() # 3181340
After this MR, with a fresh denylist:
python image_suggestions/section_image_suggestions.py analytics_platform_eng 2023-11-20 /user/analytics-platform-eng/structured-data/section_topics/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images/2023-11-20 --denylist=/user/mlitn/section_topics/section_titles_denylist --output=/user/mlitn/image_suggestions/section_image_suggestions
spark.read.parquet('/user/mlitn/image_suggestions/section_image_suggestions').count() # 1291640
Bug: T339129