Skip to content

Incorporate mwokenizer to handle non WS languages

AKhatun requested to merge add-mwtokenizer into main

A lot of changes were made. Not all components could be individually tested. The anchor dict and training sentences were scanned for anything unusual. Looks good.

Some results (precision and recall at 0.5 threshold):

wiki precision recall
bnwiki 0.7318187141351447 0.2997409823484267
bowiki 0.9405204460966543 0.08582089552238806
mywiki 0.22297297297297297 0.002977533158891997
simplewiki 0.7985790945506764 0.4305623471882641

Here are the previous results: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task/Results_round-1

bnwiki and simplewiki did not have much change in results, which is good since they are WS languages. mywiki precision dropped a lot. bowiki precision increased a lot. But none of their recall is good enough (we need at least 20%).

Merge request reports