code for running batch prediction pipeline on full dump of all supported wikis
This contains code for running the batch prediction for all articles of a snapshot:
- _preprocess: parse through the dumps, parse content, and bring into required format. we take advantage of the structure of the dumps which split the dumps into chunks of around 100K articles. we keep this structure so we dont have to load all articles into memory when running the batch prediction.
- _predict: run batch prediction with GPUs (this takes a few days)
- _postprocess: combine predictions of individual chunks into a single table.
- requirements.txt: this contains a list of packages that are required for running the pipeline (e.g. mwxml to easily parse through the dump files)
- mbert_wikis.txt: list of wikis for which the languages is supported by the mbert-model