Word Tokenization: evaluation methodology
We have currently implemented several different word-tokenization schemes for whitespace and non-whitespace languages.
- Whitespace-delimited languages -> uses a rule-based word tokenizer
- None-whitespace-delimited languages -> uses sentencepiece model for subword tokenization
Depending on this specification, we may need separate evaluation methods and datasets.
Suggestions: