Implementation of data drift checks.
In this MR we implement a PySpark job that runs 3 data drift checks:
- Calculate the last N revisions from a MariaDB replica (say,
enwiki
) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match). - Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions'
sha1
andlength
match. Print a summary. - Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.
The idea is to eventually use these metrics as a gating mechanism for cutting a dump. For now, we don't save these calculations, we just print them. Later, we will see whether it makes sense to keep these metrics on wmf_data_ops.data_quality_metrics
.
Bug: T354761