Sentence Tokenization: leading/trailing whitespace stripping
From the thread here :
Isaac: what's our expected behavior as far as leading/trailing whitespace: keep or strip out?
- On one hand: I think right now the whitespace gets stripped out which matches it to the golden rules format and makes sense for lots of applications.
- On the other hand: this makes it harder to e.g., patch back together sentences. For example, if the rule-based tokenizer splits a sentence on an abbreviation and we then want to re-attach the sentence in a second pass with the statistical tokenizer, we'll want to know what the separator was so as to not accidentally change the content.