Skip to content

Detect non-standard tables and lists

Marco Fossati requested to merge html-dumps into main

Detect Wikipedia article sections with non-standard tables and lists.

  • parse a given Wikipedia Enterprise HTML dump
  • extract <table> tags that come from templates
  • output a JSON lines file of { revision_id, page_title, section_index, section_title } objects
  • extract section headers with the same level as in the pipeline
  • skip tables with presentation Aria roles, likely not tabular data

This script must run in a machine where HTML dumps are mounted.

Merge request reports