Word tokenization: train sentencepiece for sample language clusters
CONTEXT:
Martin: it is needed to use sentencepiece for whitespace languages. one: we think that our standard tokenizer should work fairly well. second, from what I understood, sentencepiece will lead to subword tokenization which we dont want for whitespace-delimited languages where tokens are clearly defined. Before scaling to all languages with different models, I would run sentencepiece on a manually selected cluster of a few languages for which we think the grouping makes sense ( or even a single language with a large wiki such as Japanese). then implement that model into the tokenization and try to understand whether it works or not.
Goals:
-
have a preliminary list of non-whitespace languages -
have a preliminary list of language family clusters (from wiki fallback languages) -
collect corpus to train sentencepiece (10M sentences reportedly suffices) -
build training pipeline -
run small-scale experiments on sparql -
add tokenization method for non-whitespaced languages