Skip to content

Add multi-datacenter Dataset and make DAGs robust against datacenter switchovers

Mforns requested to merge add-datacenter-datasets into main

This change adds a new class HiveMultiDatacenterDataset, which helps generating sensor groups to deal with source data distributed among various datasets. When writing a DAG that has a multi-datacenter data source, the developer should create the sensor as follows:

  1. Add the dataset snippet to airflow-dags//config/datasets.yaml
hive_event_mediawiki_client_session_tick:
    datastore: hive_multi_datacenter
    table_name: event.mediawiki_client_session_tick
    partitioning: "@hourly"
    datacenters: ["eqiad", "codfw"]
  1. Create the sensor in the DAG using dataset().get_sensor_for()
from analytics.config.dag_config import dataset
with DAG() as dag:
    sensor = dataset("hive_event_mediawiki_client_session_tick").get_sensor_for(dag)

This will return a TaskGroup that can be used as if it were a sensor, for instance:

    sensor >> etl >> archive

The TaskGroup contains N sensors (where N is the number of possible datacenters) and 1 coordinator task. Each sensor pokes a given datacenter, and the coordinator task succeeds whenever the first sensor suceeds. The coordinator task also waits for a couple minutes, to allow for all datacenters to have received the data.

Edited by Mforns

Merge request reports