Skip to content

Implements a job that transforms event data to a dump friendly format.

Xcollazo requested to merge T335860-job-to-merge-events-into-iceberg into main

In this MR we add a DAG that transforms event data from event.mediawiki_page_content_change[1] to a dump friendly format.

The associated pyspark job that mainly runs a MERGE INTO can be currently found at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/mediawiki_content_dump/merge_into.py

The sink table is wmf_dumps.wikitext_raw_rc0. It is marked as rc0 to convey that this is still not final. We intend to change the schema and the partitioning strategy as we learn more down the line.

[1] Currently this table is still being developed, so in fact we are using release candidate event.rc1_mediawiki_page_content_change.

Bug: T335860

Edited by Xcollazo

Merge request reports