Skip to content

Automoderator monthly snapshot DAG

Bug: T375480

Fetches Automoderator's activity snapshot from mediawiki_history, monthly.

Steps:

  • waits for mediawiki_history snapshot to be released.
  • fetches and process activity related to Automoderator on wiki where it is deployed on, based on config.
  • each snapshot is appended to wmf_product.automoderator_activity_snapshot_monthly
  • a latest snapshot is published as TSV to https://analytics.wikimedia.org/published/datasets/
  • snapshots older than 3 months are purged from the destination table.

Results from testing:

Dev instance test result

Screenshot from 2024-10-04 10-22-43.png

Verified the results using

sudo -u analytics-privatedata spark3-sql -e "select wiki_db, count(*) from kcvelaga.automoderator_activity_snapshot_monthly group by wiki_db"

The final tsv file is at

/tmp/kcvelaga/automoderator/monthly_snaphot_archive/snapshot.tsv.bz2

Note: I marked this as draft because the queries MR is being currently reviewed ( repos/product-analytics/data-pipelines!24 (merged)) . But the DAG code can be reviewed, and can be merged after the queries have been reviewed, along with the DAG code.

Edited by KCVelaga

Merge request reports

Loading