Skip to content

Run Dumps 2.0 main DAG at a daily cadence rather than hourly.

Xcollazo requested to merge do-daily-runs-instead-of-hourly into main

In this MR we:

  • Incorporate repos/data-engineering/dumps/mediawiki-content-dump!41 (merged)
  • Run Dumps 2.0 main DAG at a daily cadence rather than hourly. Rename it to dumps_merge_events_to_wikitext_raw_daily_dag.
  • Run table maintenance for wikitext_raw weekly instead of daily, now that we will have way less commits.
  • Set spark.sql.iceberg.locality.enabled = true now that we are not in a time crunch, and can afford query planning taking multiple minutes.
  • Since we now consume daily, let's make dumps_reconcile_wikitext_raw_daily_dag wait on it, and thus avoid potentially reporting inconsistencies that have simply not had the chance to be ingested.

Bug: T377999

Edited by Xcollazo

Merge request reports