Skip to content

Process parse create events

Gmodena requested to merge T307959-process-page-create into main

Adds streaming job that listens for page_create events, queries the Action API for page content changes, and prints them to stdout.

As an entry point for code review, there following are key points of this job:

  • it uses the DataStream API to consume messages for the page_create. Event schema boilerplate is defined in org.wikimedia.dataplatform.events.PageCreate.
  • it performs calls to the Action API in an AsyncDataStream. The related Flink AsyncFunction, and http client, is implemented in wikimedia.dataplatform.AsyncActionRequest.
  • logic to parse Action API responses is implemented in org.wikimedia.dataplatform.Content (bad class name, will be renamed). The output should be adjusted to meet https://phabricator.wikimedia.org/T308017.
  • Some Kafka boilerplate classes and a query builder helper are defined in the dataplatform package object.

This MR is part of https://phabricator.wikimedia.org/T307959.

Demo

A demo of this Flink job is available at: https://yarn.wikimedia.org/proxy/application_1651744501826_85321/#/task-manager/container_e38_1651744501826_85321_01_000010/stdout

Known issues

If AscynFunction times out, the interactive process (=the thing running on state nodes) will stop, but the job and tasks (=yarn container) will continue execution.

Todo

These should be addressed in upcoming MRs

  • [] wikimedia.dataplatform.AsyncActionRequest should implement timeout / retry logic.
  • [] Update source and sinks to the new API. (cc / @tchin )
  • [] Fix netty deps version missmatch
  • [] Implement SideOutput early on
  • [] AsyncFunction should return the response status, to allow error handling downstream
  • [] Improve unit tests. What we have is the result of TDD.
  • WIP in the T307959-enrich-event-payload branch.
  • Spike: what is the best way to mock Action API responses in integration tests? (cc / @tchin )
  • Mocks in T307959-enrich-event-payload are implemented with WireMock
  • Fix scalastyle warnings.

Related work

Work in progress to refactor enrichment logic is happening at:

at https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-stream-enrichment/-/compare/T307959-process-page-create...T307959-enrich-event-payload?from_project_id=332

The idea is to pass the page payload to AsyncDataStream, and enrich in the AsyncFunction. AsyncDataStream will produce a message with a revision schema plus content and action fields.

Tagging @otto @joal @dcausse @tchin @fab as (optional) reviewers. Tagging @lbowmaker as informed.

Edited by Gmodena

Merge request reports