image suggestions data pipeline

Marco Fossati requested to merge T296814-image-suggestions into main

This MR introduces the image suggestions data pipeline, and closes https://phabricator.wikimedia.org/T296814

The Airflow DAG has the following tasks:

Issues to be solved before merge

Issue details

  • The commonswiki_file.py Airflow task fails
    • the container gets killed due to exceeding memory limits
    • see an-airflow1003.eqiad.wmnet:/home/mfossati/commonswiki_file_failure.log
    • there are a lot of broken delta parquets (i.e., dir with one _temporary file)
    • lead_image_data_latest and wikidata_data_latest look fine
      • _SUCCESS file + snappy ones
      • quickly checked with count() & show()
  • cassandra.py fails too
    • memory issues again
    • see an-airflow1003.eqiad.wmnet:/home/mfossati/cassandra_failure.log
    • only analytics_platform_eng.suggestions is there in Hive, not even sure it was written after the run
  • the latest Wikidata snapshot yields empty commonswiki_file.py output
    • the main suspect is weekly Wikidata snapshots VS monthly Wikis ones, e.g., beginning of the month: Wikidata 2022-04-04, but maybe Wikis are still on 2022-03 ?
    • no more reason to wait for the latest snapshot with the Hive sensor ?
    • or maybe another sensor that waits for actual data to be available ?
  • Hive connection fails
Edited by Marco Fossati

Merge request reports