Skip to content

code for running batch prediction pipeline on full dump of all supported wikis

MGerlach requested to merge batch_predict_dumps into main

This contains code for running the batch prediction for all articles of a snapshot:

  • _preprocess: parse through the dumps, parse content, and bring into required format. we take advantage of the structure of the dumps which split the dumps into chunks of around 100K articles. we keep this structure so we dont have to load all articles into memory when running the batch prediction.
  • _predict: run batch prediction with GPUs (this takes a few days)
  • _postprocess: combine predictions of individual chunks into a single table.
  • requirements.txt: this contains a list of packages that are required for running the pipeline (e.g. mwxml to easily parse through the dump files)
  • mbert_wikis.txt: list of wikis for which the languages is supported by the mbert-model

Merge request reports