move Wikidata & lead images to a standalone job
-
commons.py
only handles Commons' search index - remove duplicate logic
- remove single-use wrapper functions
- consolidate shared constants
- more concise symbols
- update docstrings
- update tests
Bug: T385862
Airflow test run
There's a difference of 10,848 suggestions between prod and dev, perhaps due to the non-deterministic nature of the pipeline.
tables = (
'image_suggestions_wikidata_data', 'image_suggestions_lead_image_data',
'image_suggestions_suggestions',
'image_suggestions_search_index_full', 'image_suggestions_search_index_delta',
'image_suggestions_instanceof_cache', 'image_suggestions_title_cache',
)
snapshot = '2024-12-23'
prod_db = 'analytics_platform_eng'
dev_db = 'liwd'
for t in tables:
print(t)
prod = spark.read.table(f'{prod_db}.{t}').where(f'snapshot="{snapshot}"')
dev = spark.read.table(f'{dev_db}.{t}').where(f'snapshot="{snapshot}"')
print('prod:', prod.count(), '- dev:', dev.count())
image_suggestions_wikidata_data
prod: 110915735 - dev: 110915735
image_suggestions_lead_image_data
prod: 8381808 - dev: 8381808
image_suggestions_suggestions
prod: 25664138 - dev: 25653290
image_suggestions_search_index_full
prod: 77978152 - dev: 77926499
image_suggestions_search_index_delta
prod: 199608 - dev: 203706
image_suggestions_instanceof_cache
prod: 4702646 - dev: 4698645
image_suggestions_title_cache
prod: 4655331 - dev: 4651345