Sentence Tokenization: adapt sentence-split-logic to take into account right-to-left languages

Our current logic does not properly accommodate right-to-left languages.

MARTIN: there are two components for correctly right-to-left languages (non-whitespace is a separate discussion)

our logic to merge (wrongly) split sentences; for now, we have a logic that works for left-to-right. we could file an issue and adapt the current logic for right-to-left (if it is needed; I am not sure that anything needs to change).

the identification of abbreviations from wiktionary. I am not sure how abbreviated words appear in right-to-left languages (where is the punctuation symbol and does our regex identify it correctly)

Admin message

Admin message

Sentence Tokenization: adapt sentence-split-logic to take into account right-to-left languages