Sentence Tokenization: adapt sentence-split-logic to take into account right-to-left languages
Our current logic does not properly accommodate right-to-left languages.
MARTIN: there are two components for correctly right-to-left languages (non-whitespace is a separate discussion)
- our logic to merge (wrongly) split sentences; for now, we have a logic that works for left-to-right. we could file an issue and adapt the current logic for right-to-left (if it is needed; I am not sure that anything needs to change).
- the identification of abbreviations from wiktionary. I am not sure how abbreviated words appear in right-to-left languages (where is the punctuation symbol and does our regex identify it correctly)