Job to submit reconcile events to Kafka via wmf-event-stream (!46) · Merge requests · repos / data-engineering / dumps / mediawiki-content-dump

Xcollazo requested to merge emit-via-wmf-event-stream into main Nov 07, 2024

In this MR we implement a job to submit reconcile events to Kafka via wmf-event-stream.

The job works as follows:

First, we read wikitext_inconsistent_rows for a particular compute_dt and we categorize them as follows: inconsistent revision, missing page delete, missing page move.
Based on the categorization, we proceed to collect affected revision_ids or page_ids.
- For inconsistent revisions, we hit the Analytics Replica and fetch the latest metadata available for that revision.
- For missing page moves, we figure the latest revision of that page, and hit the Analytics Replica and fetch the latest revision metadata, which includes the page metadata.
- For missing page deletes we do not need to hit the Replica, we already have the page_id in hand.
All the above comes on a per content slot basis from MariaDB, and so we then collapse it via COLLECT_LIST() and convert it into page_content_change events that mimic being page_change_kind = edit | move | delete.
Then these events are emitted via the new wmf-event-stream Spark Data Source sink.
Finally, we mutate wikitext_inconsistent_rows to mark rows as 'processed' via the reconcile_emit_dt column. A later run for the same compute_dt will ignore by default rows WHERE reconcile_emit_dt IS NOT NULL, thus making the pipeline safe for rerun when only some wikis failed. We can still --force the pipeline to reconcile.

We were able to do much all of it in SQL, which I think makes the code more readable than if it was pyspark, but alas, it is still a complex transformation.

Bug: T368755

Edited Nov 15, 2024 by Xcollazo

Admin message

Admin message

Job to submit reconcile events to Kafka via wmf-event-stream

Merge request reports