How to register an account on GitLab. To prevent spam, new accounts are locked until approved by an admin or the approver bot. You can also file an unlock request to expedite access.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

Word Tokenization: evaluation methodology

We have currently implemented several different word-tokenization schemes for whitespace and non-whitespace languages.

Whitespace-delimited languages -> uses a rule-based word tokenizer
None-whitespace-delimited languages -> uses sentencepiece model for subword tokenization

Depending on this specification, we may need separate evaluation methods and datasets.

Suggestions:

Edited Feb 15, 2023 by Appledora