Fixing addalink pipeline
This fixes several bugs in the addalink pipeline:
- backtesting evaluation: Table with results was empty as all sentences from testing data were skipped. We remove the check that output from
process_page
should bestr
(it ismwparserfromhell.wikicode.Wikicode
) ingenerate_backtesting_eval
- embeddings snapshots: Using the topic-embeddings we didnt select a snapshot (there are 3 snapshots stored). Since we join that table 2 times, we ended up with 9 duplicate rows in
link_train.parquet
. We specifysnapshot
andwiki_dbs
ingenerate_training_data
to getembeddings_df
. We also store the embeddings in a directory/embeddings
so that the same embeddings are available for inference. - embeddings distances: Almost all distances (w2v) were close to 1. This was caused from a wrong join to get the embeddings of the link-target. We now do the correct join in
generate_training_data
The pipeline was succesfully run with simplewiki and enwiki. The backtesting results (especially precision) is similar or better compared to the previously trained models.