Add multi-datacenter Dataset and make DAGs robust against datacenter switchovers
This change adds a new class HiveMultiDatacenterDataset, which helps generating sensor groups to deal with source data distributed among various datasets. When writing a DAG that has a multi-datacenter data source, the developer should create the sensor as follows:
- Add the dataset snippet to airflow-dags//config/datasets.yaml
hive_event_mediawiki_client_session_tick:
datastore: hive_multi_datacenter
table_name: event.mediawiki_client_session_tick
partitioning: "@hourly"
datacenters: ["eqiad", "codfw"]
- Create the sensor in the DAG using dataset().get_sensor_for()
from analytics.config.dag_config import dataset
with DAG() as dag:
sensor = dataset("hive_event_mediawiki_client_session_tick").get_sensor_for(dag)
This will return a TaskGroup that can be used as if it were a sensor, for instance:
sensor >> etl >> archive
The TaskGroup contains N sensors (where N is the number of possible datacenters) and 1 coordinator task. Each sensor pokes a given datacenter, and the coordinator task succeeds whenever the first sensor suceeds. The coordinator task also waits for a couple minutes, to allow for all datacenters to have received the data.