Skip to content

Draft: Accept denylist as parquet

Matthias Mullie requested to merge T339129 into main

Like we also saw in MR30 in section-topics, a fresh denylist run is much more extensive (caused by it using the new SEAL alignment instead of the old one) than the one currently used/bundled in this repo. The denylist seems alright, but it results in far fewer results - perhaps we ought to discuss whether we want to keep all currently denylisted sections, or filter less agressively.

Current master:

python image_suggestions/section_image_suggestions.py analytics_platform_eng 2023-11-20 /user/analytics-platform-eng/structured-data/section_topics/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images/2023-11-20 --output=/user/mlitn/image_suggestions/section_image_suggestions_old

spark.read.parquet('/user/mlitn/image_suggestions/section_image_suggestions_old').count() # 3181340

After this MR, with a fresh denylist:

python image_suggestions/section_image_suggestions.py analytics_platform_eng 2023-11-20 /user/analytics-platform-eng/structured-data/section_topics/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2023-11-20 /user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images/2023-11-20 --denylist=/user/mlitn/section_topics/section_titles_denylist --output=/user/mlitn/image_suggestions/section_image_suggestions

spark.read.parquet('/user/mlitn/image_suggestions/section_image_suggestions').count() # 1291640

Bug: T339129

Merge request reports