Skip to content

Implement pipeline for `mediawiki_content_current_v1`

Xcollazo requested to merge add-mw-content-current into main

In this MR we:

  • Define DDL for wmf_content.mediawiki_content_current_v1. Over at source table wmf_content.mediawiki_content_history_v1, we have 3 control columns. For this table we only need one control column, row_update_dt, as we consume changes from the one source. See the control column's usage in code for details on why this is needed.
  • We implement a MERGE INTO pyspark job to gather per page changes from the source table, and apply them to the target. We tried a few tricks, like dataframe caching to accelerate change detection, but found out we got little or no benefit while making the code more complicated, thus opted to eliminate that code.
  • Were we did find benefit is on doing a BROADCAST of the detected changes when doing the full table scan on the source table. For this, we include a flag to toggle between BROADCAST and MERGE join strategies. The calculated changes dataframe is about ~1M changes (~200MBs) for a typical daily ingestion, but we want to have the option to do a MERGE join in case we need to catchup multiple days and for backfills.
  • As per Spike T366544, The MERGE statement requires no state, and is self-healing; if we failed a daily run, we can ignore and rerun it later and we will pick up all changes.

Note this pipeline takes ~1.6h with locality off, and ~2.1h with locality on. (yes, a whole 30mins doing query planning when pulling HDFS Location info from the headnode!) We have a flag for this to be tested in prod.

Bug: T391282

Bug: T391283

Edited by Xcollazo

Merge request reports

Loading