Word tokenization: train sentencepiece for sample language clusters

CONTEXT:

Martin: it is needed to use sentencepiece for whitespace languages. one: we think that our standard tokenizer should work fairly well. second, from what I understood, sentencepiece will lead to subword tokenization which we dont want for whitespace-delimited languages where tokens are clearly defined. Before scaling to all languages with different models, I would run sentencepiece on a manually selected cluster of a few languages for which we think the grouping makes sense ( or even a single language with a large wiki such as Japanese). then implement that model into the tokenization and try to understand whether it works or not.

Goals:

have a preliminary list of non-whitespace languages
have a preliminary list of language family clusters (from wiki fallback languages)
collect corpus to train sentencepiece (10M sentences reportedly suffices)
build training pipeline
run small-scale experiments on sparql
add tokenization method for non-whitespaced languages

Edited Mar 31, 2023 by Appledora

Admin message

Admin message

Admin message

Word tokenization: train sentencepiece for sample language clusters