Run Dumps 2.0 main DAG at a daily cadence rather than hourly.
In this MR we:
- Incorporate repos/data-engineering/dumps/mediawiki-content-dump!41 (merged)
- Run Dumps 2.0 main DAG at a daily cadence rather than hourly. Rename it to
dumps_merge_events_to_wikitext_raw_daily_dag
. - Run table maintenance for
wikitext_raw
weekly instead of daily, now that we will have way less commits. - Set
spark.sql.iceberg.locality.enabled = true
now that we are not in a time crunch, and can afford query planning taking multiple minutes. - Since we now consume daily, let's make
dumps_reconcile_wikitext_raw_daily_dag
wait on it, and thus avoid potentially reporting inconsistencies that have simply not had the chance to be ingested.
Bug: T377999
Edited by Xcollazo