Word Tokenization: add abbreviation post-processing
We have implemented an iterative rule-based word-tokenization scheme for Whitespace delimited languages, which performs adequately at the moment. Tokens returned by this approach can be any sequence of consecutive whitespace characters, any sequence of consecutive punctuations, single punctuation characters not surrounded by word characters, or any sequence of word characters (which may include internal punctuation). For example, the text "I'm a sentence. \n"
will be tokenized into ["I'm", " ", "a", " ", "sentence", ".", " \n"]
.
However, sometimes punctuations at the end of sentences could also be a part of a valid token i.e: abbreviations. For example, our rule-based approach would tokenize Dr.
to ["Dr", "."].
To address this issue, we want to explore post-processing the tokens by utilizing our generated abbreviations files (Issue 10). We have already integrated this for sentence tokenization (Issue 5).