Skip to content

Programatic declaration of Airflow Pools. Single slot pool for MERGEs into wmf_dumps.wikitext_raw

Xcollazo requested to merge add-pool-def-to-dumps-dags into main

In this MR we do 2 things to introduce the use of Airflow pools.

We define a new mechanism, PoolRegistry, that borrows heavily from existing DagProperties and DatasetRegistry mechanisms. This new mechanism:

  • Reads pool definition from a list of YAML files.
  • Keeps these definitions in memory.
  • In the event that a pool is needed, it creates the pool against the Airflow Pool API if needed.
  • This mechanism does not delete any existing pools, nor mutates them. That is, if there is an inconsistency between the YAML files and the Airflow DB, the DB wins.

Additionally, in the YAML file for the analytics instance, we add a new pool called merge_into_mutex_for_wmf_dumps_wikitext_raw. This pool with 1 slot allows us to have multiple DAGs with MERGE INTOs that effectively run their spark tasks serially against table wmf_dumps.wikitext_raw, thus mimicking a mutex. This solves an issue where multiple MERGEs on recent data fail sporadically on conflicts. We also bump the priority_weight of the MERGEs from DAG dumps_merge_backfill_to_wikitext_raw so that backfills are prioritized.

Bug: T347718

Edited by Xcollazo

Merge request reports