Automoderator monthly snapshot DAG
Bug: T375480
Fetches Automoderator's activity snapshot from mediawiki_history, monthly.
Steps:
- waits for mediawiki_history snapshot to be released.
- fetches and process activity related to Automoderator on wiki where it is deployed on, based on config.
- each snapshot is appended to
wmf_product.automoderator_activity_snapshot_monthly
- a latest snapshot is published as TSV to https://analytics.wikimedia.org/published/datasets/
- snapshots older than 3 months are purged from the destination table.
Results from testing:
Dev instance test result
Verified the results using
sudo -u analytics-privatedata spark3-sql -e "select wiki_db, count(*) from kcvelaga.automoderator_activity_snapshot_monthly group by wiki_db"
The final tsv file is at
/tmp/kcvelaga/automoderator/monthly_snaphot_archive/snapshot.tsv.bz2
Note: I marked this as draft because the queries MR is being currently reviewed ( repos/product-analytics/data-pipelines!24 (merged)) . But the DAG code can be reviewed, and can be merged after the queries have been reviewed, along with the DAG code.