Skip to content

First version

Marco Fossati requested to merge dev into main

This is SEAL's first complete version. It's made of 4 PySpark scripts:

  1. sections.py gathers relevant content from Wikipedia sections
  2. embeddings.py encodes Wikipedia section heading embeddings with LaBSE
  3. features.py extracts features from Wikipedia sections to train a section alignment model
  4. model.py trains & evaluate a model, then predicts all (source, target) heading pairs

Caveat: no tests, unfortunately out of scope for now due to higher priority work on Commons.

Note that GitLab CI jobs use mamba and no longer depend on workflow utils: the conda version there is too old and gets stuck in a never-ending environment resolution.

Bug: T325316

Edited by Marco Fossati

Merge request reports