Implement pipeline for `mediawiki_content_current_v1`
In this MR we:
- Define DDL for
wmf_content.mediawiki_content_current_v1
. Over at source tablewmf_content.mediawiki_content_history_v1
, we have 3 control columns. For this table we only need one control column,row_update_dt
, as we consume changes from the one source. See the control column's usage in code for details on why this is needed. - We implement a MERGE INTO pyspark job to gather per page changes from the source table, and apply them to the target. We tried a few tricks, like dataframe caching to accelerate change detection, but found out we got little or no benefit while making the code more complicated, thus opted to eliminate that code.
- Were we did find benefit is on doing a BROADCAST of the detected changes when doing the full table scan on the source table. For this, we include a flag to toggle between BROADCAST and MERGE join strategies. The calculated changes dataframe is about ~1M changes (~200MBs) for a typical daily ingestion, but we want to have the option to do a MERGE join in case we need to catchup multiple days and for backfills.
- As per Spike T366544, The MERGE statement requires no state, and is self-healing; if we failed a daily run, we can ignore and rerun it later and we will pick up all changes.
Note this pipeline takes ~1.6h with locality off, and ~2.1h with locality on. (yes, a whole 30mins doing query planning when pulling HDFS Location info from the headnode!) We have a flag for this to be tested in prod.
Bug: T391282
Bug: T391283