Adds streaming job that listens for page_create events, queries the Action API for page content changes, and prints them to stdout.
As an entry point for code review, there following are key points of this job:
- it uses the DataStream API to consume messages for the page_create. Event schema boilerplate is defined in
org.wikimedia.dataplatform.events.PageCreate
. - it performs calls to the Action API in an
AsyncDataStream
. The related FlinkAsyncFunction
, and http client, is implemented inwikimedia.dataplatform.AsyncActionRequest
. - logic to parse Action API responses is implemented in
org.wikimedia.dataplatform.Content
(bad class name, will be renamed). The output should be adjusted to meet https://phabricator.wikimedia.org/T308017. - Some Kafka boilerplate classes and a query builder helper are defined in the
dataplatform
package object.
This MR is part of https://phabricator.wikimedia.org/T307959.
Demo
A demo of this Flink job is available at: https://yarn.wikimedia.org/proxy/application_1651744501826_85321/#/task-manager/container_e38_1651744501826_85321_01_000010/stdout
Known issues
If AscynFunction times out, the interactive process (=the thing running on state nodes) will stop, but the job and tasks (=yarn container) will continue execution.
Todo
These should be addressed in upcoming MRs
- []
wikimedia.dataplatform.AsyncActionRequest
should implement timeout / retry logic. - [] Update source and sinks to the new API. (cc / @tchin )
- [] Fix netty deps version missmatch
- [] Implement SideOutput early on
- [] AsyncFunction should return the response status, to allow error handling downstream
- [] Improve unit tests. What we have is the result of TDD.
- WIP in the
T307959-enrich-event-payload
branch. -
Spike: what is the best way to mock Action API responses in integration tests? (cc / @tchin ) - Mocks in
T307959-enrich-event-payload
are implemented with WireMock -
Fix scalastyle warnings.
Related work
Work in progress to refactor enrichment logic is happening at:
The idea is to pass the page payload to AsyncDataStream, and enrich in the AsyncFunction. AsyncDataStream will produce a message with a revision schema plus content and action fields.
Tagging @otto @joal @dcausse @tchin @fab as (optional) reviewers. Tagging @lbowmaker as informed.