just to remind myself: we still need the encoder for capturing wiki_db as a feature for the model, right? I remember that one issue around it was the application to a wiki that was not part of the training data (but here the problem was that then the variable is not defined, I think).
MGerlach (20b1ecfe) at 25 Mar 09:14
notebook to count nocookie requests
MGerlach (202a2d96) at 20 Mar 08:45
adding notebooks
this will also count the spaces and commas, right?
This MR
Current baseline for language dependent models (w/ outlink embedding):
wiki | precision | recall |
---|---|---|
arwiki | 0.808 | 0.340 |
bnwiki | 0.721 | 0.349 |
bowiki | 0.982 | 0.613 |
cswiki | 0.785 | 0.437 |
dewiki | 0.816 | 0.453 |
dzwiki | 1.0 | 0.1 |
ganwiki | 0.843 | 0.300 |
piwiki | nan | 0.0 |
ptwiki | 0.835 | 0.433 |
simplewiki | 0.783 | 0.398 |
viwiki | 0.875 | 0.572 |
MGerlach (6eda5ea0) at 26 Feb 11:15
Merge branch 'new-baseline' into 'language-agnostic-main'
... and 1 more commit
This MR
Current baseline for language dependent models (w/ outlink embedding):
wiki | precision | recall |
---|---|---|
arwiki | 0.808 | 0.340 |
bnwiki | 0.721 | 0.349 |
bowiki | 0.982 | 0.613 |
cswiki | 0.785 | 0.437 |
dewiki | 0.816 | 0.453 |
dzwiki | 1.0 | 0.1 |
ganwiki | 0.843 | 0.300 |
piwiki | nan | 0.0 |
ptwiki | 0.835 | 0.433 |
simplewiki | 0.783 | 0.398 |
viwiki | 0.875 | 0.572 |
could be a separate MR
MGerlach (33d86f5d) at 23 Feb 09:35
initial commit
MGerlach (7b17b1ea) at 01 Feb 09:10
readme for setting up venv
MGerlach (4b3aac0f) at 01 Feb 09:04
initial commit
Fixed regex that turns detected mention in text to a link. Currently it was detecting words with word boundaries (\b
) which was inherently looking for white spaces. This does not work with non-whitespace languages. The regex was modified to detect mention as a pure substring. This improved the recall of most languages that were previously failing. It also does not deteriorate the performance of other models. Some example languages were run to ensure consistent performance.
14 of the 22 language's recall improved. Rest had similar results. There was no significant drop in performance.
Languages that were previously failing (previous = the state of the link-recommendation as of last commit)
wiki | previous precision | precision | previous recall | recall | comments |
---|---|---|---|---|---|
aswiki | 0.67 | 0.68 | 0.17 | 0.28 | recall improvement |
bowiki | 0.90 | 0.98 | 0.07 | 0.62 | recall improvement |
diqwiki | 0.92 | 0.88 | 0.35 | 0.49 | recall improvement, slight drop in precision |
dvwiki | 1.0 | 0.88 | 0.02 | 0.49 | recall improvement, slight drop in precision |
dzwiki | 1.0 | 1.0 | 0.07 | 0.23 | recall improvement |
fywiki | 0.82 | 0.82 | 0.45 | 0.459 | similar results |
ganwiki | 0.88 | 0.82 | 0.07 | 0.296 | recall improvement |
hywwiki | 0.78 | 0.75 | 0.20 | 0.30 | recall improvement |
jawiki | 0.85 | 0.82 | 0.06 | 0.35 | recall improvement |
krcwiki | 0.77 | 0.78 | 0.33 | 0.35 | similar results |
mnwwiki | 1.0 | 0.97 | 0.02 | 0.68 | recall improvement |
mywiki | 0.70 | 0.95 | 0.047 | 0.82 | recall improvement |
piwiki | 0 | 0 | nan | nan | only 13 sentences |
shnwiki | 0.99 | 0.99 | 0.77 | 0.88 | recall improvement |
snwiki | 0.67 | 0.69 | 0.16 | 0.18 | similar results |
szywiki | 0.69 | 0.79 | 0.23 | 0.48 | improvement |
tiwiki | 0.796 | 0.796 | 0.48 | 0.48 | similar results |
urwiki | 0.86 | 0.86 | 0.53 | 0.54 | similar results |
wuuwiki | 0.42 | 0.68 | 0.007 | 0.36 | improvement |
zhwiki | 0.82 | 0.78 | 0.04 | 0.47 | improvement |
zh_classicalwiki | 1.0 | 1.0 | 0.0001 | 0.0001 | no improvement |
zh_yuewiki | 0.31 | 0.31 | 0.0006 | 0.0006 | no improvement |
Some other languages.
wiki | previous precision | precision | previous recall | recall | comments |
---|---|---|---|---|---|
arwiki | 0.82 | 0.82 | 0.35 | 0.36 | similar |
bnwiki | 0.734 | 0.725 | 0.295 | 0.38 | similar |
cswiki | 0.80 | 0.80 | 0.45 | 0.45 | similar |
dewiki | 0.83 | 0.83 | 0.48 | 0.48 | similar |
frwiki | 0.82 | 0.82 | 0.50 | 0.50 | similar |
simplewiki | 0.79 | 0.79 | 0.43 | 0.43 | similar |
viwiki | 0.91 | 0.91 | 0.67 | 0.67 | similar |
Fixed regex that turns detected mention in text to a link. Currently it was detecting words with word boundaries (\b
) which was inherently looking for white spaces. This does not work with non-whitespace languages. The regex was modified to detect mention as a pure substring. This improved the recall of most languages that were previously failing. It also does not deteriorate the performance of other models. Some example languages were run to ensure consistent performance.
14 of the 22 language's recall improved. Rest had similar results. There was no significant drop in performance.
Languages that were previously failing (previous = the state of the link-recommendation as of last commit)
wiki | previous precision | precision | previous recall | recall | comments |
---|---|---|---|---|---|
aswiki | 0.67 | 0.68 | 0.17 | 0.28 | recall improvement |
bowiki | 0.90 | 0.98 | 0.07 | 0.62 | recall improvement |
diqwiki | 0.92 | 0.88 | 0.35 | 0.49 | recall improvement, slight drop in precision |
dvwiki | 1.0 | 0.88 | 0.02 | 0.49 | recall improvement, slight drop in precision |
dzwiki | 1.0 | 1.0 | 0.07 | 0.23 | recall improvement |
fywiki | 0.82 | 0.82 | 0.45 | 0.459 | similar results |
ganwiki | 0.88 | 0.82 | 0.07 | 0.296 | recall improvement |
hywwiki | 0.78 | 0.75 | 0.20 | 0.30 | recall improvement |
jawiki | 0.85 | 0.82 | 0.06 | 0.35 | recall improvement |
krcwiki | 0.77 | 0.78 | 0.33 | 0.35 | similar results |
mnwwiki | 1.0 | 0.97 | 0.02 | 0.68 | recall improvement |
mywiki | 0.70 | 0.95 | 0.047 | 0.82 | recall improvement |
piwiki | 0 | 0 | nan | nan | only 13 sentences |
shnwiki | 0.99 | 0.99 | 0.77 | 0.88 | recall improvement |
snwiki | 0.67 | 0.69 | 0.16 | 0.18 | similar results |
szywiki | 0.69 | 0.79 | 0.23 | 0.48 | improvement |
tiwiki | 0.796 | 0.796 | 0.48 | 0.48 | similar results |
urwiki | 0.86 | 0.86 | 0.53 | 0.54 | similar results |
wuuwiki | 0.42 | 0.68 | 0.007 | 0.36 | improvement |
zhwiki | 0.82 | 0.78 | 0.04 | 0.47 | improvement |
zh_classicalwiki | 1.0 | 1.0 | 0.0001 | 0.0001 | no improvement |
zh_yuewiki | 0.31 | 0.31 | 0.0006 | 0.0006 | no improvement |
Some other languages.
wiki | previous precision | precision | previous recall | recall | comments |
---|---|---|---|---|---|
arwiki | 0.82 | 0.82 | 0.35 | 0.36 | similar |
bnwiki | 0.734 | 0.725 | 0.295 | 0.38 | similar |
cswiki | 0.80 | 0.80 | 0.45 | 0.45 | similar |
dewiki | 0.83 | 0.83 | 0.48 | 0.48 | similar |
frwiki | 0.82 | 0.82 | 0.50 | 0.50 | similar |
simplewiki | 0.79 | 0.79 | 0.43 | 0.43 | similar |
viwiki | 0.91 | 0.91 | 0.67 | 0.67 | similar |
zhwiki
and fywiki
Unicode errors were resolved by using wikipedia2vec==2.0.0, but it gave rise to IndexError for several other languages that had run successfully before. To make the script work for all languages the following changes were made:
zhwiki
and fywiki
All models now run successfully.