Word Tokenization Starter Study
For statistical sentence tokenizers, we might need help from word tokenization. Although word tokenization is more complicated, there are existing codebases we can build on. Given the time restriction, we are switching to word tokenization without completely wrapping up sentence tokenization.
Potential Steps:
- one survey paper (most-recent)
- Review existing tools
- Rule-based (whitespace and non-whitespace languages)
- Unsupervised (language-agnostic)
- Dataset Collection
- Implement baseline tokenizer
- Implement benchmarking method
- Iteratively update the tokenizer