Split line by \n to generate html diff
Previous implementation str.splitlines(keepends=True) was splitting on \n, \r\n, \r, U+2028 LINE SEPARATOR, U+2029, etc. patch library reads with readline() on \n, causing inconsistency in diff parsing and thus throwing error
Observed 25 events in error sink (for ~3 days worth of data), all for this reason. Spot checked a few of these events with this fix, and the diff-apply loop was correctly applied.
Cause
-
make_unified_diffusedstr.splitlines(keepends=True). -
splitlines()breaks on any Unicode line break:\n,\r\n, bare\r, U+2028 LINE SEPARATOR, U+2029, etc. -
difflib.unified_diffthen treats each piece as a line and fills@@ -a,b +c,d @@soa/cmatch that line count. - Serialized diff rows are still one record per
\n(whatreadline()sees inpython-patch). - If the only separator between two “lines” for
splitlinesis e.g. U+2028, they become two logical lines for difflib but one physical\n-terminated row in the file → fewer rows than the hunk header says → the next@@appears “too early” → parser error (“invalid unified diff format”).
So the diff was internally inconsistent: correct for difflib’s line model, wrong for LF-row-based parsers.
Fix
-
make_unified_diffnow explicitly splits on\ninstead ofsplitlines(). - Adds the newline back at the end of each line to preserve original string
- As before, adds a
\nat the end of the whole string.
Then one LF-terminated row in HTML ⇒ one line for difflib ⇒ hunk counts match what python-patch reads.
Tests
- New test:
test_unified_diff_round_trip_with_line_separator_u2028— text with U+2028 inside a row must round-trip throughmake_unified_diff+apply_unified_diff. - Existing
test_diffscases still pass.
Bug: T419969