Skip to content

Full rehaul of HTML dumps looking towards 1.0.0 release

Isaac Johnson requested to merge full-rehaul into main

Full rehaul of HTML dumps to standardize, add more flexibility to use outside of HTML dumps, and make plaintext generation more extendable. More detailed list of changes:


  • Update documentation
  • Separate code for generating constants as utils outside of core installed library
  • Structure tests so running them requires installing editable local version of code to align with best practices. Contribution guide updated to reflect and all imports switched from relative to absolute
  • Remove unused constants


  • No longer require additional metadata from Enterprise dump to be present to process an article's HTML. (#48 (closed))
  • Dump no longer has helper parameter for stopping after x articles. This can easily be done in a for loop and complicates the code.
  • Create new Document object for Dumps that holds an article's HTML plus the additional metadata found in the HTML dumps


  • Transclusion check now also checks parents to see if they were transcluded which more accurately reflects transclusion status of an item. For example, now a wikilink within an infobox should also be marked as transcluded.
  • Split up media into separate image/audio/video functions for simplicity and consistency
  • Move is_ functions to be associated directly with their class objects and optimize them to run faster
  • Remove some unnecessary parameters in Elements. For example, we weren't doing anything special with calculating plaintext so that doesn't need its own attribute. Not all elementst have title/tid so remove from base class. Transclusion now best calculated via is_transcluded util but this is semi-expensive so I don't do it by default.
  • Simplify string representations for better readability / debugging
  • Add new Table (wikitable/infobox/sidebar/message box - #29, #53 (closed), #117 (closed), #116 (closed)) object, Hatnote (#118 (closed)), Section, Citation, and Comment classes.
  • Align get_<element> functions to all return a list of the Element class objects for consistency rather than some returning Elements, some returning strings, etc.
  • I removed the Template class and just left in a get_templates function. While things like Category, Reference, etc. make sense in the HTML because there are HTML tags that correspond, the Template doesn't really exist in the HTML, it's just a property of another tag (that it was transcluded). So I adjusted it so you could extract them but they weren't treated the same as the other HTML elements.


  • Move plaintext functions to separate file (to avoid some weird import circle issues).
  • Add basic plaintext tests
  • Plaintext now emits not just text but also transclusion status and the parent node types for that text so the user can decide which types to retain and which to skip -- e.g., skip categories + references + templated content.
  • Created helper functiton for getting first paragraph that is opinionated about what to exclude. This is mainly to show how one might adjust the more general plaintext function but also might be useful itself.
Edited by Isaac Johnson

Merge request reports