Resolve "train sentencepiece for sample language clusters"
Goals :
- Gather corpus to train sentencepiece for non-whitespace languages
- Have trainer codes
- Have a preliminary sentencepiece model for prototyping
- Incorporate methods that utilizes trained sentencepiece models for tokenization
- Add tests for checking the NWS language tokenization method
Closes #18 (closed)
Edited by Appledora