Detect inconsistent rows from wmf_dumps.wikimedia_raw
In this MR we implement a DAG to run a daily job that runs the PySpark job from repos/data-engineering/dumps/mediawiki-content-dump!30 (merged).
This DAG follows closely the pattern from !628 (merged) for leveraging dynamic task mapping.
For now, the DAG only has 2 steps:
- Fetch a
dblist
to create dynamic tasks. - Run the PySpark job that, for each wiki from that
dblist
, detects inconsistent(wiki_db, revision_id)
tuples.
A later revision of this DAG will also trigger a reconciliation mechanism that will consume the results of (2).
I'd like to get this DAG to prod to start testing a full run of the ~900 wikis.
Bug: T368756