Skip to content

Bump mwtokenizer version

AKhatun requested to merge bump-version into main

Bump from 0.1.0 to 0.2.0.

Changes:

  • Non whitespace languages output "▁" in place of " " as per sentencepiece. We replace "▁" with " " in the tokenizer itself, so end users don't have to.
  • Separate spaces and punctuations as separate tokens for non whitespace languages, as is done for whitespace languages.

Merge request reports