Skip to content

Add DAG to backfill wmf_dumps.wikitext_raw.

Xcollazo requested to merge add-mediawiki-dumps-backfill into main

(Depends on repos/data-engineering/dumps/mediawiki-content-dump!10 (merged))

In this MR we:

  • Introduce a group of DAGs that will backfill wmf_dumps.wikitext_raw (right now wikitext_raw_rc1). There are 4 groups, all of them relative to enwiki:
groups = [
    {"name": "enwiki", "exclusive": False, "members": ["enwiki"]},  # (size is 100% of revisions of enwiki)
    {"name": "wikidatawiki", "exclusive": False, "members": ["wikidatawiki"]},  # (size is 71%  of revisions of enwiki)
    {"name": "commonswiki", "exclusive": False, "members": ["commonswiki"]},  # (size is 42%  of revisions of enwiki)
    {
        "name": "all_other_wikis",
        "exclusive": True,
        "members": ["enwiki", "wikidatawiki", "commonswiki"],
    },  # (size is 97%  of revisions of enwiki)
]
  • For reasons discussed in https://phabricator.wikimedia.org/T340861#9114717, we currently backfill on a per year per group cadence. We started doing it monthly, but found an executor configuration that allows yearly (thanks @joal!). This means that, for each group above, we currently generate 23 Spark tasks for years 2001 to 2023. To be conscious of resources, we only run one task per group max. Each of the Spark jobs currently uses 22.5% of cluster resources (106 containers, 211 cores, 2.5TB memory).
  • These jobs run Spark 3.3.2, thus the configuration for for_virtual_env() is a bit wonky. I intend to open separate tickets to make it easier to run custom Spark versions on our cluster.
  • Because Spark 3.1's Shuffle Service is incompatible with 3.3's, we are running the jobs with fixed resources. This can be reverted when we get https://phabricator.wikimedia.org/T344910
  • Finally, I also fixed the events streaming job to use Spark 3.3.2 as well.

Bug: T340861

Bug: T344709

Edited by Xcollazo

Merge request reports