first steps for the Airflow job
This MR sets up the initial Airflow job tasks.
- task zero: Airflow sensor that waits for the latest Wikidata snapshot/partition
- task 1: Spark job that gathers weighted tags for the Commons search index update
Specifically, I'd love to get advice on the snapshot handling. We agreed on the following steps:
- use the
{{ ds }}
Airflow template to pass the latestwmf.wikidata_entity
snapshot/partition - implement
SensorTask
to wait for the given Hive partitions - pass
{{ ds }}
to both tasks
Besides the SensorTask
implementation details, I'd like to confirm whether passing {{ ds }}
will actually work for task 1.