Adds streaming job that listens for page_create events, queries the Action API for page content changes, and prints them to stdout.
As an entry point for code review, there following are key points of this job:
- it uses the DataStream API to consume messages for the page_create. Event schema boilerplate is defined in
- it performs calls to the Action API in an
AsyncDataStream. The related Flink
AsyncFunction, and http client, is implemented in
- logic to parse Action API responses is implemented in
org.wikimedia.dataplatform.Content(bad class name, will be renamed). The output should be adjusted to meet https://phabricator.wikimedia.org/T308017.
- Some Kafka boilerplate classes and a query builder helper are defined in the
This MR is part of https://phabricator.wikimedia.org/T307959.
A demo of this Flink job is available at: https://yarn.wikimedia.org/proxy/application_1651744501826_85321/#/task-manager/container_e38_1651744501826_85321_01_000010/stdout
If AscynFunction times out, the interactive process (=the thing running on state nodes) will stop, but the job and tasks (=yarn container) will continue execution.
These should be addressed in upcoming MRs
wikimedia.dataplatform.AsyncActionRequestshould implement timeout / retry logic.
-  Update source and sinks to the new API. (cc / @tchin )
-  Fix netty deps version missmatch
-  Implement SideOutput early on
-  AsyncFunction should return the response status, to allow error handling downstream
-  Improve unit tests. What we have is the result of TDD.
- WIP in the
Spike: what is the best way to mock Action API responses in integration tests? (cc / @tchin )
- Mocks in
T307959-enrich-event-payloadare implemented with WireMock
Fix scalastyle warnings.
Work in progress to refactor enrichment logic is happening at:
The idea is to pass the page payload to AsyncDataStream, and enrich in the AsyncFunction. AsyncDataStream will produce a message with a revision schema plus content and action fields.