Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • M MediaWiki Stream Enrichment
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Custom issue tracker
    • Custom issue tracker
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • repos
  • data-engineering
  • MediaWiki Stream Enrichment
  • Merge requests
  • !7

Merge multiple streams and produce enriched page change content.

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Gmodena requested to merge T307959-enrich-event-payload into main Jun 03, 2022
  • Overview 61
  • Commits 39
  • Pipelines 12
  • Changes 14

Bug: T307959 This MR is a draft/wip and not ready to be merged yet.

Pass the page payload to AsyncDataStream, and enrich it in the AsyncFunction. AsyncDataStream will produce a message with a revision schema plus content and action fields.

Changes

The main changes are:

  1. Support for multiple types of page changes (create, delete, edit).
  2. Improved error handling and logging.
  3. Decouples enrichment logic from async behaviour.
  4. Adds integration tests for local e2e execution.

A good entry point for this MR would be the test at src/test/scala/EnrichmentSuite.scala. Enrichment.makePipeline contains most boilerplate for setting up the DAG topology. src/main/scala/org/wikimedia/dataplatform/AsyncEnrichWithContent.scala contains the Flink AsynFunction that calls out Action API to perform enrichment.

TODOs

Follow up work

  1. Add retry on error for network calls.
  2. Implement side output for managing error reporting.
  3. Revisit naming conventions (especially across modules).
  4. Add testing boilerplate that accounts for schema validation and Json schema resources.

Reviewers / informed

@tchin @otto @dcausse @joal @lbowmaker

Edited Jun 13, 2022 by Gmodena
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: T307959-enrich-event-payload