Resolve "train sentencepiece for sample language clusters"

Goals :

  • Gather corpus to train sentencepiece for non-whitespace languages
  • Have trainer codes
  • Have a preliminary sentencepiece model for prototyping
  • Incorporate methods that utilizes trained sentencepiece models for tokenization
  • Add tests for checking the NWS language tokenization method

Closes #18 (closed)

Edited by Appledora

Merge request reports

Loading