Skip to content

Fixing addalink pipeline

MGerlach requested to merge addalink into main

This fixes several bugs in the addalink pipeline:

  • backtesting evaluation: Table with results was empty as all sentences from testing data were skipped. We remove the check that output from process_page should be str (it is mwparserfromhell.wikicode.Wikicode) in generate_backtesting_eval
  • embeddings snapshots: Using the topic-embeddings we didnt select a snapshot (there are 3 snapshots stored). Since we join that table 2 times, we ended up with 9 duplicate rows in link_train.parquet. We specify snapshot and wiki_dbs in generate_training_data to get embeddings_df. We also store the embeddings in a directory /embeddings so that the same embeddings are available for inference.
  • embeddings distances: Almost all distances (w2v) were close to 1. This was caused from a wrong join to get the embeddings of the link-target. We now do the correct join in generate_training_data

The pipeline was succesfully run with simplewiki and enwiki. The backtesting results (especially precision) is similar or better compared to the previously trained models.

Merge request reports

Loading