Skip to content

Job to submit reconcile events to Kafka via wmf-event-stream

Xcollazo requested to merge emit-via-wmf-event-stream into main

In this MR we implement a job to submit reconcile events to Kafka via wmf-event-stream.

The job works as follows:

  • First, we read wikitext_inconsistent_rows for a particular compute_dt and we categorize them as follows: inconsistent revision, missing page delete, missing page move.
  • Based on the categorization, we proceed to collect affected revision_ids or page_ids.
    • For inconsistent revisions, we hit the Analytics Replica and fetch the latest metadata available for that revision.
    • For missing page moves, we figure the latest revision of that page, and hit the Analytics Replica and fetch the latest revision metadata, which includes the page metadata.
    • For missing page deletes we do not need to hit the Replica, we already have the page_id in hand.
  • All the above comes on a per content slot basis from MariaDB, and so we then collapse it via COLLECT_LIST() and convert it into page_content_change events that mimic being page_change_kind = edit | move | delete.
  • Then these events are emitted via the new wmf-event-stream Spark Data Source sink.
  • Finally, we mutate wikitext_inconsistent_rows to mark rows as 'processed' via the reconcile_emit_dt column. A later run for the same compute_dt will ignore by default rows WHERE reconcile_emit_dt IS NOT NULL, thus making the pipeline safe for rerun when only some wikis failed. We can still --force the pipeline to reconcile.

We were able to do much all of it in SQL, which I think makes the code more readable than if it was pyspark, but alas, it is still a complex transformation.

Bug: T368755

Edited by Xcollazo

Merge request reports