First version (!1) · Merge requests · repos / structured-data / seal

Marco Fossati requested to merge dev into main Oct 03, 2023

This is SEAL's first complete version. It's made of 4 PySpark scripts:

sections.py gathers relevant content from Wikipedia sections
embeddings.py encodes Wikipedia section heading embeddings with LaBSE
features.py extracts features from Wikipedia sections to train a section alignment model
model.py trains & evaluate a model, then predicts all (source, target) heading pairs

Caveat: no tests, unfortunately out of scope for now due to higher priority work on Commons.

Note that GitLab CI jobs use mamba and no longer depend on workflow utils: the conda version there is too old and gets stuck in a never-ending environment resolution.

Bug: T325316

Edited Oct 03, 2023 by Marco Fossati

Admin message

Admin message

First version

Merge request reports