Skip to content

Migrate Refine from systemd to Airflow

Aqu requested to merge T356762_iceberg_table_management into main

This process will create a staging DAG on Airflow Analytics, which will be deployed progressively:

  1. Feed the DAG a mocked version of ESC containing a sample of ~170 datasets, targeting a staging database for refined data.
  2. Run the DAG in parallel with the existing Refine process on systemd and use an ad-hoc script to check for differences.
  3. Gradually increase the sample size to assess the effect of the load on the current Airflow setup.
  4. For deployment, we will switch the output of systemd Refine to this new Refine, allowing it to write to the event DB while continuing to check for discrepancies.
  5. Finally, remove the diffing process and deactivate the legacy Refine.

DAG Details:

  • Loads the configuration from ESC and creates one task group per enabled dataset.
  • Updates the table schema to reflect the latest JSON schema version.
  • Refines the data and creates a new Hive partition.

This branch is currently running on the test cluster.

Bug: T356762

Edited by Aqu

Merge request reports