Add monthly table-maintenance to the "wikidata" airflow instance.
The Wikidata Airflow instance populates an HDFS table called wikidata.wdqs_external_queries_by_user_agent_daily . This table is cleaned up at the row level every 90 days, to comply with retention policies on user-agent strings, but Iceberg creates snapshots that we also need to clean up.
This MR does a few things to make this happen:
- Add the
wikidata.wdqs_external_queries_by_user_agent_dailytable to the global dataset registry in config/datasets.yaml - Add the "wikidata" instance to the instance_properties config file, which was required to populate our dataset into the registry (otherwise the registry-creation function could not see our instance)
- Use the shared create_table_maintenance_iceberg_dags function to spawn DAGs to clean up the snapshots. The default settings are saved in
wmf_airflow_common.dataset.IcebergDataset.DEFAULT_MAINTENANCE; see here. We can override the defaults in the datasets.yaml definition if we want to. - Also, the test fixtures were autogenerated by
make test.rebuild-fixturesin the repo.
Bug: T418723