Skip to content

Detect inconsistent rows from wmf_dumps.wikimedia_raw

Xcollazo requested to merge emit-mismatch-rows into main

In this MR we implement a DAG to run a daily job that runs the PySpark job from repos/data-engineering/dumps/mediawiki-content-dump!30 (merged).

This DAG follows closely the pattern from !628 (merged) for leveraging dynamic task mapping.

For now, the DAG only has 2 steps:

  1. Fetch a dblist to create dynamic tasks.
  2. Run the PySpark job that, for each wiki from that dblist, detects inconsistent (wiki_db, revision_id) tuples.

A later revision of this DAG will also trigger a reconciliation mechanism that will consume the results of (2).

I'd like to get this DAG to prod to start testing a full run of the ~900 wikis.

Bug: T368756

Edited by Xcollazo

Merge request reports