Add multi-datacenter Dataset and make DAGs robust against datacenter switchovers (!243) · Merge requests · repos / data-engineering / Airflow DAGs

Mforns requested to merge add-datacenter-datasets into main Feb 23, 2023

This change adds a new class HiveMultiDatacenterDataset, which helps generating sensor groups to deal with source data distributed among various datasets. When writing a DAG that has a multi-datacenter data source, the developer should create the sensor as follows:

Add the dataset snippet to airflow-dags//config/datasets.yaml

hive_event_mediawiki_client_session_tick:
    datastore: hive_multi_datacenter
    table_name: event.mediawiki_client_session_tick
    partitioning: "@hourly"
    datacenters: ["eqiad", "codfw"]

Create the sensor in the DAG using dataset().get_sensor_for()

from analytics.config.dag_config import dataset
with DAG() as dag:
    sensor = dataset("hive_event_mediawiki_client_session_tick").get_sensor_for(dag)

This will return a TaskGroup that can be used as if it were a sensor, for instance:

    sensor >> etl >> archive

The TaskGroup contains N sensors (where N is the number of possible datacenters) and 1 coordinator task. Each sensor pokes a given datacenter, and the coordinator task succeeds whenever the first sensor suceeds. The coordinator task also waits for a couple minutes, to allow for all datacenters to have received the data.

Edited Feb 27, 2023 by Mforns

Admin message

Admin message

Add multi-datacenter Dataset and make DAGs robust against datacenter switchovers

Merge request reports