image suggestions data pipeline
This MR introduces the image suggestions data pipeline, and closes https://phabricator.wikimedia.org/T296814
The Airflow DAG has the following tasks:
- wait for the latest snapshot date of relevant tables with a Hive sensor
- Wikidata weighted tags for the Commons search index (https://phabricator.wikimedia.org/T302095)
- all-Wikis image suggestions for Cassandra (https://phabricator.wikimedia.org/T299789)
- suggestion flags for Wikis search indices (https://phabricator.wikimedia.org/T299884)
clean up HDFS-
suggestionsCassandra table (https://phabricator.wikimedia.org/T293808) -
title_cacheCassandra table (https://phabricator.wikimedia.org/T293808) -
instanceof_cacheCassandra table (https://phabricator.wikimedia.org/T293808)
Issues to be solved before merge
-
merge !50 (merged) -
broken outputs: containers get killed due to memory errors, see https://phabricator.wikimedia.org/T307362. Fix at !57 (diffs) -
the latest Wikidata snapshot yields empty commonswiki_file.pyoutput, see https://phabricator.wikimedia.org/T307371. Fix at !55 (merged) -
Hive connection within Airflow fails. See !55 (comment 6700)
Issue details
- The
commonswiki_file.pyAirflow task fails- the container gets killed due to exceeding memory limits
- see
an-airflow1003.eqiad.wmnet:/home/mfossati/commonswiki_file_failure.log - there are a lot of broken delta parquets (i.e., dir with one
_temporaryfile) -
lead_image_data_latestandwikidata_data_latestlook fine-
_SUCCESSfile + snappy ones - quickly checked with
count()&show()
-
-
cassandra.pyfails too- memory issues again
- see
an-airflow1003.eqiad.wmnet:/home/mfossati/cassandra_failure.log - only
analytics_platform_eng.suggestionsis there in Hive, not even sure it was written after the run
- the latest Wikidata snapshot yields empty
commonswiki_file.pyoutput- the main suspect is weekly Wikidata snapshots VS monthly Wikis ones, e.g., beginning of the month: Wikidata 2022-04-04, but maybe Wikis are still on 2022-03 ?
- no more reason to wait for the latest snapshot with the Hive sensor ?
- or maybe another sensor that waits for actual data to be available ?
- Hive connection fails
- http://localhost:8600/log?dag_id=image-suggestions&task_id=wait_for_hive_partitions&execution_date=2022-04-27T15%3A05%3A10.184768%2B00%3A00
- added
metastore_defaultconnection in Admin panel on the Airflow Web UI - maybe try with this URI thrift://analytics-hive.eqiad.wmnet:9083 ?
- passed
metastore_conn_id='analytics-hive'to the Hive sensor constructor - currently hitting !55 (comment 6700)