Detect non-standard tables and lists (!21) · Merge requests · repos / structured-data / Section Topics · GitLab

How to register an account on GitLab. Due to spam, new accounts are locked until approved by an admin or the approver bot. Your GitLab account gets automatically approved within one hour if you are a member of Trusted Contributors in Gerrit, or a member of the Trusted-Contributors group in Phabricator and linked your Developer account to your Phabricator account. If none of these apply, you can file an unlock request to expedite access.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

Marco Fossati requested to merge html-dumps into main Mar 15, 2023

Detect Wikipedia article sections with non-standard tables and lists.

parse a given Wikipedia Enterprise HTML dump
extract <table> tags that come from templates
output a JSON lines file of { revision_id, page_title, section_index, section_title } objects
extract section headers with the same level as in the pipeline
skip tables with presentation Aria roles, likely not tabular data

This script must run in a machine where HTML dumps are mounted.