How to register an account on GitLab. To prevent spam, new accounts are locked until approved by an admin or the approver bot. You can also file an unlock request to expedite access.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

Word Tokenization Starter Study

For statistical sentence tokenizers, we might need help from word tokenization. Although word tokenization is more complicated, there are existing codebases we can build on. Given the time restriction, we are switching to word tokenization without completely wrapping up sentence tokenization.

Potential Steps:

one survey paper (most-recent)
Review existing tools
- Rule-based (whitespace and non-whitespace languages)
- Unsupervised (language-agnostic)
Dataset Collection
Implement baseline tokenizer
Implement benchmarking method
Iteratively update the tokenizer

Literature Document

Edited Dec 19, 2022 by Appledora