Replace sentence special whitespace character "▁" with " "
Currently non-whitespace languages are tokenized and output as is. This leaves the users to perform .replace("▁", " ")
.
We want to do this on our end, leaving the resulted tokens less ambiguous for users.
Caveat (not solved): sentencepiece always adds a "▁" before sentences. This messes up with existing whitespaces in the beginning of the sentence. So we expect users to provide a whitespace stripped sentence and perform .lstrip()
on the reconstructed sentence to ensure that joining the tokens is the same as the original sentence.
Sometimes the sp space is added to existing spaces, sometimes not. Varies by language. Example issues:
- 4 spaces: " វិគីភីឌាភាសាខ្មែរត..." --> ['▁▁▁▁', '▁វិគីភីឌា', 'ភាសាខ្មែរ', ... ]
- 4 spaces: " အပြည်ပြည်ဆိုင်ရာ ထ..." --> ['▁▁▁▁', '▁အ', 'ပြည်ပြည်ဆိုင်ရာ', ... ]
- 3 spaces: " វិគីភីឌាភាសាខ្មែរត..." --> ['▁▁▁▁', 'វិគីភីឌា', 'ភាសាខ្មែរ', ... ]
- 3 spaces: " အပြည်ပြည်ဆိုင်ရာ ထ..." --> '▁▁▁', '▁အ', 'ပြည်ပြည်ဆိုင်ရာ', ... ]
- 2 spaces: " វិគីភីឌាភាសាខ្មែរត..." --> ['▁▁', '▁វិគីភីឌា', 'ភាសាខ្មែរ', ... ]
- 2 spaces: " အပြည်ပြည်ဆိုင်ရာ ထ..." --> ['▁▁', '▁အ', 'ပြည်ပြည်ဆိုင်ရာ', ... ]
- 1 spaces: " វិគីភីឌាភាសាខ្មែរត..." --> ['▁▁', 'វិគីភីឌា', 'ភាសាខ្មែរ', ... ]
- 1 spaces: " အပြည်ပြည်ဆိုင်ရာ ထ..." --> ['▁', '▁အ', 'ပြည်ပြည်ဆိုင်ရာ', ... ]
- 0 spaces: "វិគីភីឌាភាសាខ្មែរត..." --> ['▁វិគីភីឌា', 'ភាសាខ្មែរ', ... ]
- 0 spaces: "အပြည်ပြည်ဆိုင်ရာ ထ..." --> ['▁အ', 'ပြည်ပြည်ဆိုင်ရာ', ... ]