Incorporate mwokenizer to handle non WS languages
A lot of changes were made. Not all components could be individually tested. The anchor dict and training sentences were scanned for anything unusual. Looks good.
Some results (precision and recall at 0.5 threshold):
wiki | precision | recall |
---|---|---|
bnwiki | 0.7318187141351447 | 0.2997409823484267 |
bowiki | 0.9405204460966543 | 0.08582089552238806 |
mywiki | 0.22297297297297297 | 0.002977533158891997 |
simplewiki | 0.7985790945506764 | 0.4305623471882641 |
Here are the previous results: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task/Results_round-1
bnwiki and simplewiki did not have much change in results, which is good since they are WS languages. mywiki precision dropped a lot. bowiki precision increased a lot. But none of their recall is good enough (we need at least 20%).