Don't process events with large content body.
Events that have a content body above a certain
threshold should not be processed. For these events,
a request will not be issued and a ContentBodyTooLargeError
exception will be raised instead.
This should mitigate rare cases where the resulting event might exceed the max record size allowed by Kafka.
This is an additional defense layer atop https://phabricator.wikimedia.org/T344688, which accounts for all large payloads we observed so far.
The default max allowed rev_size
defaults to 8_000_000 bytes,
and the expected max message size in the page_content_change
is 10MB. This should leave enough buffer for message metadata
plus the originating event.
Getting the exact size of a nested dict (bytes) is not straightforward, and requires traversing the dict and getting the size of each referenced object. We also have no guarantee that a Python object size will match its JVM counterpart (most likely not), nor the actual Json string produced in Kafka.
Alternatively, enriching the event and converting it to a utf-8
encoded (Json) string
would be more precise but expensive. Since problematic events are a rare occurrence (we saw one
in 150+MM events so far), I'd rather avoid additional SerDe overhead and
use an heuristic instead.
I did consider moving this logic to eventutilities-python
, but it's not obvious where a good place would be.
We would need to inspect message size before committing to Kafka which means:
- we could check size in the
EventProcessFunction
output, but it's not obvious that we should always drop the message there (Flink operators downstream could still consume large messages). - before accessing the sink, or serializing Python -> JVM: but we would not be able to forward to sideoutput.
- in Java
eventutilities
: this might require having an ad-hocFlinkProducer
, which seems overkill given the number of current uses cases.
Related work
This MR contains a fix for T345147
Bug: T342399
Bug: T345147