Add monthly table-maintenance to the "wikidata" airflow instance.

The Wikidata Airflow instance populates an HDFS table called wikidata.wdqs_external_queries_by_user_agent_daily . This table is cleaned up at the row level every 90 days, to comply with retention policies on user-agent strings, but Iceberg creates snapshots that we also need to clean up.

This MR does a few things to make this happen:

  • Add the wikidata.wdqs_external_queries_by_user_agent_daily table to the global dataset registry in config/datasets.yaml
  • Add the "wikidata" instance to the instance_properties config file, which was required to populate our dataset into the registry (otherwise the registry-creation function could not see our instance)
  • Use the shared create_table_maintenance_iceberg_dags function to spawn DAGs to clean up the snapshots. The default settings are saved in wmf_airflow_common.dataset.IcebergDataset.DEFAULT_MAINTENANCE ; see here. We can override the defaults in the datasets.yaml definition if we want to.
  • Also, the test fixtures were autogenerated by make test.rebuild-fixtures in the repo.

Bug: T418723

cc: @trueg @gmodena @andrewtavis-wmde

Edited by Lerickson

Merge request reports

Loading