Skip to content

Refactor `wikitext_raw` table to support backfilling

Xcollazo requested to merge optimize-writes into main

In this MR we change the schema of wikitext_raw_rc0 to wikitext_raw_rc1 like so:

  • Change the partitioning strategy from hours(revision_timestamp) to (wiki_db, days(revision_timestamp)).
    • The rationale for days(revision_timestamp) is that this strategy generates much less ORed predicates that we need to push down when doing the MERGE INTO. This will also help to contain the amount of files in HDFS once we call CALL spark_catalog.system.rewrite_data_files() on it.
    • The rationale for adding a wiki_db partition is to aid the backfilling process. This process touches all days(revision_timestamp) partitions and thus we need a separate mechanism that pushes down the wiki_db in order to make the backfill manageable. This way we can ingest in wiki_db groupings.
    • Since partitioning keys are orthogonal in Iceberg, this strategy, so far, gives us a good ingestion compromise.
  • Switch from parquet to avro. After discussions with the team, we figured this is safer given that content_slots contain full revisions.
  • Flatten out the schema of the target table. We now include what we believe to be the neccesary fields to make a dump out of and nothing else.
  • We introduce a helper TIMESTAMP row called row_last_updated. The idea is that it will serve as a watermark that we will bump every time we touch a particular row.
    • For streaming ingests, we will update it with meta.dt (time the event was received by the system).
    • For backfills, we will update it with the backfilling table's 'freshness date', which in the case of wmf.mediawiki_wikitext_history it happens to be snapshot (which is the dumps 1.0 release date).
    • Notice how, in the event of a stream ingest or backfill, if we have more recent data already (ie. higher watermark) then we ignore the update.

Additionally, we add a new MERGE INTO pyspark script that can backfill at a monthly granularity given the scalability issues described at https://phabricator.wikimedia.org/T340861

Bug: T340861 Bug: T336714

Edited by Xcollazo

Merge request reports