Resolve "Sentence Tokenizer: add FLORES dataset for benchmarking" (!15) · Merge requests · repos / research / Wiki NLP Tools · GitLab

How to register an account on GitLab. Due to spam, new accounts are locked until approved by an admin or the approver bot. Your GitLab account gets automatically approved within one hour if you are a member of Trusted Contributors in Gerrit, or a member of the Trusted-Contributors group in Phabricator and linked your Developer account to your Phabricator account. If none of these apply, you can file an unlock request to expedite access.

Take the 2024 Developer Satisfaction Survey ^{(privacy statement)} to help identify areas for improvement and measure satisfaction within the Wikimedia developer community.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

Appledora requested to merge 31-flores into main May 06, 2023

picked the corresponding sentences of the ENGLISH dataset, from all the language files in the FLORES corpus
compiled 888 sentences into a benchmarking json file format
calculated metrics of benchmarking as previously discussed

Issues:

- For some of the languages, I couldn't find a 2-letter format as specified by ISO and as used by wikipedia + our project. 
- Among the `204` laanguage files in the corpus, language codes couldn't be identified for `87`. 
- I have used their 3-letter-code in the compiled json.

Closes #31 (closed)