Add quality and drift data analysis
This MR adds jupyter notebook to support T340831 with data analysis.
Key takeways:
-
page_change_v1
shows stable throghput over time. We should be able to characterize "normal" traffic. -
rc1_mediawiki_page_content_change
contains spurious data (pipeline re-runs, duplicate events) that skews statistics. - processed time vs event time drift seems consistent with both behaviour. This issue is not related to these specific dataset, but should be investigated upstream.
Metric metrics we should consider for alerting on data quality regressions:
- absolute number of processed events (consumed, produced, produced vs consumed).
- rate of change (day to day) of processed events (consumed, produced, produced vs consumed).
- day-to-day variation in rate of change of processed events (consumed, produced, produced vs consumed).
- processed time vs events time drift (number of events with drift > 1 day per period).
- error type distribution over time (TBD, not enough data).
cc / @xcollazo @milimetric @aqu @joal
Bug: T340831
Bug: T341134