Skip to content

Resolve "Sentence Tokenizer: add FLORES dataset for benchmarking"

Appledora requested to merge 31-flores into main
  • picked the corresponding sentences of the ENGLISH dataset, from all the language files in the FLORES corpus
  • compiled 888 sentences into a benchmarking json file format
  • calculated metrics of benchmarking as previously discussed

Issues:

- For some of the languages, I couldn't find a 2-letter format as specified by ISO and as used by wikipedia + our project. 
- Among the `204` laanguage files in the corpus, language codes couldn't be identified for `87`. 
- I have used their 3-letter-code in the compiled json.

Closes #31 (closed)

Merge request reports