Skip to content

move Wikidata & lead images to a standalone job

Marco Fossati requested to merge T385862 into main
  • commons.py only handles Commons' search index
  • remove duplicate logic
  • remove single-use wrapper functions
  • consolidate shared constants
  • more concise symbols
  • update docstrings
  • update tests

Bug: T385862

Airflow test run

There's a difference of 10,848 suggestions between prod and dev, perhaps due to the non-deterministic nature of the pipeline.

tables = (
    'image_suggestions_wikidata_data', 'image_suggestions_lead_image_data',
    'image_suggestions_suggestions',
    'image_suggestions_search_index_full', 'image_suggestions_search_index_delta',
    'image_suggestions_instanceof_cache', 'image_suggestions_title_cache',
)
snapshot = '2024-12-23'
prod_db = 'analytics_platform_eng'
dev_db = 'liwd'
for t in tables:
    print(t)
    prod = spark.read.table(f'{prod_db}.{t}').where(f'snapshot="{snapshot}"')
    dev = spark.read.table(f'{dev_db}.{t}').where(f'snapshot="{snapshot}"')
    print('prod:', prod.count(), '- dev:', dev.count())

image_suggestions_wikidata_data
prod: 110915735 - dev: 110915735
image_suggestions_lead_image_data
prod: 8381808 - dev: 8381808
image_suggestions_suggestions
prod: 25664138 - dev: 25653290
image_suggestions_search_index_full
prod: 77978152 - dev: 77926499
image_suggestions_search_index_delta
prod: 199608 - dev: 203706
image_suggestions_instanceof_cache
prod: 4702646 - dev: 4698645
image_suggestions_title_cache
prod: 4655331 - dev: 4651345
Edited by Marco Fossati

Merge request reports

Loading