Skip to content

Skip Commons delta if its row count is above a given threshold

Marco Fossati requested to merge T380389 into main

Given the discussion in https://phabricator.wikimedia.org/T372912#10336684, the first step to optimize the search index deltas is to skip Commons based on a row threshold.

Note that we're not setting a default threshold here, but will pass a value through the CLI from the DAG.

This MR includes some formatting, so the actual change lives in image_suggestions/shared.py and image_suggestions/search_indices.py.

Bug: T380389

Airflow test run

I've added "commons_delta_threshold": 1984, in Airflow's variables. Result:

prod = spark.read.table('analytics_platform_eng.image_suggestions_search_index_delta').where('snapshot="2024-12-02"')
dev = spark.read.table('isu_commons.image_suggestions_search_index_delta').where('snapshot="2024-12-02"')
prod.count(), dev.count()
(8277335, 594234)

prod.where('wikiid="commonswiki"').count(), dev.where('wikiid="commonswiki"').count()
(8046800, 0)
Edited by Marco Fossati

Merge request reports