Word Tokenization: Whitespace-delimited languages
Goal: Implement a word-tokenizer for whitespace-delimited languages. We will start with a simple lightweight tokenizer based mostly on regexes, i.e. splitting on whitespaces etc. We will refine the tokenizer using more complex approaches in later iterations.
Approach:
- output should be same type as sentence segmentations (i.e: yield the tokens)
- should be language-specific
- should work for all languages that are not contained in our list of
NON Whitespace-delimited langauges
References
- Background: Why we need separate tokenizers for each language?
- Starter tools: Word Tokenization bg study