Resolve "train sentencepiece for sample language clusters"

Appledora requested to merge 18-train-sentencepiece-for-sample-languages into main

Goals :

  • Gather corpus to train sentencepiece for non-whitespace languages
  • Have trainer codes
  • Have a preliminary sentencepiece model for prototyping
  • Incorporate methods that utilizes trained sentencepiece models for tokenization
  • Add tests for checking the NWS language tokenization method

Closes #18 (closed)

