Don't process events with large content body. (!69) · Merge requests · repos / data-engineering / Mediawiki Event Enrichment

Gmodena requested to merge filter-large-records into main Aug 29, 2023

Events that have a content body above a certain threshold should not be processed. For these events, a request will not be issued and a ContentBodyTooLargeError exception will be raised instead.

This should mitigate rare cases where the resulting event might exceed the max record size allowed by Kafka.

This is an additional defense layer atop https://phabricator.wikimedia.org/T344688, which accounts for all large payloads we observed so far.

The default max allowed rev_size defaults to 8_000_000 bytes, and the expected max message size in the page_content_change is 10MB. This should leave enough buffer for message metadata plus the originating event.

Getting the exact size of a nested dict (bytes) is not straightforward, and requires traversing the dict and getting the size of each referenced object. We also have no guarantee that a Python object size will match its JVM counterpart (most likely not), nor the actual Json string produced in Kafka.

Alternatively, enriching the event and converting it to a utf-8 encoded (Json) string would be more precise but expensive. Since problematic events are a rare occurrence (we saw one in 150+MM events so far), I'd rather avoid additional SerDe overhead and use an heuristic instead.

I did consider moving this logic to eventutilities-python, but it's not obvious where a good place would be. We would need to inspect message size before committing to Kafka which means:

we could check size in the EventProcessFunction output, but it's not obvious that we should always drop the message there (Flink operators downstream could still consume large messages).
before accessing the sink, or serializing Python -> JVM: but we would not be able to forward to sideoutput.
in Java eventutilities: this might require having an ad-hoc FlinkProducer, which seems overkill given the number of current uses cases.

Related work

This MR contains a fix for T345147

Bug: T342399

Bug: T345147

Edited Aug 31, 2023 by Gmodena

Admin message

Admin message

Admin message

Don't process events with large content body.

Related work

Merge request reports