Sentence Tokenization: leading/trailing whitespace stripping

From the thread here :

Isaac: what's our expected behavior as far as leading/trailing whitespace: keep or strip out?

On one hand: I think right now the whitespace gets stripped out which matches it to the golden rules format and makes sense for lots of applications.

On the other hand: this makes it harder to e.g., patch back together sentences. For example, if the rule-based tokenizer splits a sentence on an abbreviation and we then want to re-attach the sentence in a second pass with the statistical tokenizer, we'll want to know what the separator was so as to not accidentally change the content.

Edited Nov 01, 2022 by Appledora

Admin message

Admin message

Sentence Tokenization: leading/trailing whitespace stripping