Job to submit reconcile events to Kafka via wmf-event-stream
In this MR we implement a job to submit reconcile events to Kafka via wmf-event-stream.
The job works as follows:
- First, we read
wikitext_inconsistent_rows
for a particularcompute_dt
and we categorize them as follows: inconsistent revision, missing page delete, missing page move. - Based on the categorization, we proceed to collect affected
revision_id
s orpage_id
s.- For inconsistent revisions, we hit the Analytics Replica and fetch the latest metadata available for that revision.
- For missing page moves, we figure the latest revision of that page, and hit the Analytics Replica and fetch the latest revision metadata, which includes the page metadata.
- For missing page deletes we do not need to hit the Replica, we already have the
page_id
in hand.
- All the above comes on a per content slot basis from MariaDB, and so we then collapse it via
COLLECT_LIST()
and convert it intopage_content_change
events that mimic beingpage_change_kind = edit | move | delete
. - Then these events are emitted via the new
wmf-event-stream
Spark Data Source sink. - Finally, we mutate
wikitext_inconsistent_rows
to mark rows as 'processed' via thereconcile_emit_dt
column. A later run for the samecompute_dt
will ignore by default rowsWHERE reconcile_emit_dt IS NOT NULL
, thus making the pipeline safe for rerun when only some wikis failed. We can still--force
the pipeline to reconcile.
We were able to do much all of it in SQL, which I think makes the code more readable than if it was pyspark, but alas, it is still a complex transformation.
Bug: T368755