Implements a job that transforms event data to a dump friendly format.
In this MR we add a DAG that transforms event data from event.mediawiki_page_content_change
[1] to a dump friendly format.
The associated pyspark job that mainly runs a MERGE INTO can be currently found at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/mediawiki_content_dump/merge_into.py
The sink table is wmf_dumps.wikitext_raw_rc0
. It is marked as rc0
to convey that this is still not final. We intend to change the schema and the partitioning strategy as we learn more down the line.
[1] Currently this table is still being developed, so in fact we are using release candidate event.rc1_mediawiki_page_content_change
.
Bug: T335860