First version
This is SEAL's first complete version. It's made of 4 PySpark scripts:
-
sections.py
gathers relevant content from Wikipedia sections -
embeddings.py
encodes Wikipedia section heading embeddings with LaBSE -
features.py
extracts features from Wikipedia sections to train a section alignment model -
model.py
trains & evaluate a model, then predicts all (source, target) heading pairs
Caveat: no tests, unfortunately out of scope for now due to higher priority work on Commons.
Note that GitLab CI jobs use mamba and no longer depend on workflow utils: the conda version there is too old and gets stuck in a never-ending environment resolution.
Bug: T325316