Detect non-standard tables and lists
Detect Wikipedia article sections with non-standard tables and lists.
- parse a given Wikipedia Enterprise HTML dump
- extract
<table>
tags that come from templates - output a JSON lines file of
{ revision_id, page_title, section_index, section_title }
objects - extract section headers with the same level as in the pipeline
- skip tables with
presentation
Aria roles, likely not tabular data
This script must run in a machine where HTML dumps are mounted.