First version
This is SEAL's first complete version. It's made of 4 PySpark scripts:
-
sections.pygathers relevant content from Wikipedia sections -
embeddings.pyencodes Wikipedia section heading embeddings with LaBSE -
features.pyextracts features from Wikipedia sections to train a section alignment model -
model.pytrains & evaluate a model, then predicts all (source, target) heading pairs
Caveat: no tests, unfortunately out of scope for now due to higher priority work on Commons.
Note that GitLab CI jobs use mamba and no longer depend on workflow utils: the conda version there is too old and gets stuck in a never-ending environment resolution.
Bug: T325316