Implementation of data drift checks. (!24) · Merge requests · repos / data-engineering / dumps / mediawiki-content-dump

Xcollazo requested to merge implement-first-set-of-data-quality-checks-two into main Feb 07, 2024

In this MR we implement a PySpark job that runs 3 data drift checks:

Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

The idea is to eventually use these metrics as a gating mechanism for cutting a dump. For now, we don't save these calculations, we just print them. Later, we will see whether it makes sense to keep these metrics on wmf_data_ops.data_quality_metrics.

Bug: T354761

Edited Feb 08, 2024 by Xcollazo

Admin message

Admin message

Admin message

Implementation of data drift checks.

Merge request reports