Split punctuation in non-WS languages.

Currently we return tokens generated from sentencepiece with little to no modification. This sometimes causes punctuation, especially sentence terminators, to remain as part of words.

Martin:

We seem to be doing something differently in the word-tokenization for non-ws languages.
Input sentence: အပြည်ပြည်ဆိုင်ရာ ထေရဝါဒဗုဒ္ဓသာသနာပြုတက္ကသိုလ် တွင်လည်း ဒုတိယပါမောက္ခချုပ်အဖြစ် တာဝန်ထမ်းဆောင်ခဲ့သည်။
word-tokenization with language_code=my:  ['▁အ', 'ပြည်ပြည်ဆိုင်ရာ', '▁ထေရဝါဒ', 'ဗုဒ္ဓ', 'သာသနာပြု', 'တက္ကသိုလ်', '▁တွင်လည်း', '▁ဒုတိယ', 'ပါမောက္ခ', 'ချုပ်အဖြစ်', '▁တာဝန်ထမ်းဆောင်', 'ခဲ့သည်။']
word-tokeniaztion with language_code=en:  ['အပြည်ပြည်ဆိုင်ရာ', ' ', 'ထေရဝါဒဗုဒ္ဓသာသနာပြုတက္ကသိုလ်', ' ', 'တွင်လည်း', ' ', 'ဒုတိယပါမောက္ခချုပ်အဖြစ်', ' ', 'တာဝန်ထမ်းဆောင်ခဲ့သည်', '။']

Ideally, we want to separate the '။' from tokens.

Admin message

Admin message

Split punctuation in non-WS languages.