Skip to content

first steps for the Airflow job

This MR sets up the initial Airflow job tasks.

  • task zero: Airflow sensor that waits for the latest Wikidata snapshot/partition
  • task 1: Spark job that gathers weighted tags for the Commons search index update

Specifically, I'd love to get advice on the snapshot handling. We agreed on the following steps:

  • use the {{ ds }} Airflow template to pass the latest wmf.wikidata_entity snapshot/partition
  • implement SensorTask to wait for the given Hive partitions
  • pass {{ ds }} to both tasks

Besides the SensorTask implementation details, I'd like to confirm whether passing {{ ds }} will actually work for task 1.

Edited by Marco Fossati

Merge request reports