Skip to content

Implementation of data drift checks.

Xcollazo requested to merge implement-first-set-of-data-quality-checks-two into main

In this MR we implement a PySpark job that runs 3 data drift checks:

  • Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
  • Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
  • Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

The idea is to eventually use these metrics as a gating mechanism for cutting a dump. For now, we don't save these calculations, we just print them. Later, we will see whether it makes sense to keep these metrics on wmf_data_ops.data_quality_metrics.

Bug: T354761

Edited by Xcollazo

Merge request reports