Skip to content

Add quality and drift data analysis

Gmodena requested to merge T340831-add-data-analysis into main

This MR adds jupyter notebook to support T340831 with data analysis.

Key takeways:

  • page_change_v1 shows stable throghput over time. We should be able to characterize "normal" traffic.
  • rc1_mediawiki_page_content_change contains spurious data (pipeline re-runs, duplicate events) that skews statistics.
  • processed time vs event time drift seems consistent with both behaviour. This issue is not related to these specific dataset, but should be investigated upstream.

Metric metrics we should consider for alerting on data quality regressions:

  • absolute number of processed events (consumed, produced, produced vs consumed).
  • rate of change (day to day) of processed events (consumed, produced, produced vs consumed).
  • day-to-day variation in rate of change of processed events (consumed, produced, produced vs consumed).
  • processed time vs events time drift (number of events with drift > 1 day per period).
  • error type distribution over time (TBD, not enough data).

cc / @xcollazo @milimetric @aqu @joal

Bug: T340831

Bug: T341134

Edited by Gmodena

Merge request reports