How to register an account on GitLab. Due to spam, new accounts are locked until approved by an admin or the approver bot. Your GitLab account gets automatically approved within one hour if you are a member of Trusted Contributors in Gerrit, or a member of the Trusted-Contributors group in Phabricator and linked your Developer account to your Phabricator account. If none of these apply, you can file an unlock request to expedite access.

Take the 2024 Developer Satisfaction Survey ^{(privacy statement)} to help identify areas for improvement and measure satisfaction within the Wikimedia developer community.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

Word Tokenization: Whitespace-delimited languages

Goal: Implement a word-tokenizer for whitespace-delimited languages. We will start with a simple lightweight tokenizer based mostly on regexes, i.e. splitting on whitespaces etc. We will refine the tokenizer using more complex approaches in later iterations.

Approach:

output should be same type as sentence segmentations (i.e: yield the tokens)
should be language-specific
should work for all languages that are not contained in our list of NON Whitespace-delimited langauges

References

Background: Why we need separate tokenizers for each language?
Starter tools: Word Tokenization bg study

Edited Feb 10, 2023 by MGerlach