Skip to content

Add job to publish content dumps as XML

Xcollazo requested to merge publish-dumps-to-xml-take-2 into main

(This MR depends on https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/991795/ being released as part of refinery-job-0.2.29-shaded.jar. This has now been merged.)

In this MR we implement a first cut of the Airflow DAG that will convert the intermediate table wmf_dumps.wikitext_raw into actual XML dumps.

The DAG itself is incomplete, as we do not have a proper sensor yet. Additionally, we are only dumping simplewiki right now.

Still, we'd like to start exercising this code paths on a regular basis, thus we want to get this MR in prod.

Bug: T346278

Edited by Xcollazo

Merge request reports