Skip Commons delta if its row count is above a given threshold
Given the discussion in https://phabricator.wikimedia.org/T372912#10336684, the first step to optimize the search index deltas is to skip Commons based on a row threshold.
Note that we're not setting a default threshold here, but will pass a value through the CLI from the DAG.
This MR includes some formatting, so the actual change lives in image_suggestions/shared.py
and image_suggestions/search_indices.py
.
Bug: T380389
Airflow test run
I've added "commons_delta_threshold": 1984,
in Airflow's variables. Result:
prod = spark.read.table('analytics_platform_eng.image_suggestions_search_index_delta').where('snapshot="2024-12-02"')
dev = spark.read.table('isu_commons.image_suggestions_search_index_delta').where('snapshot="2024-12-02"')
prod.count(), dev.count()
(8277335, 594234)
prod.where('wikiid="commonswiki"').count(), dev.where('wikiid="commonswiki"').count()
(8046800, 0)