Full rehaul of HTML dumps looking towards 1.0.0 release (!20) · Merge requests · repos / research / html-dumps

Isaac Johnson requested to merge full-rehaul into main Dec 20, 2023

Full rehaul of HTML dumps to standardize, add more flexibility to use outside of HTML dumps, and make plaintext generation more extendable. More detailed list of changes:

Miscellaneous/Packaging:

Update documentation
Separate code for generating constants as utils outside of core installed library
Structure tests so running them requires installing editable local version of code to align with best practices. Contribution guide updated to reflect and all imports switched from relative to absolute
Remove unused constants

Dumps:

No longer require additional metadata from Enterprise dump to be present to process an article's HTML. (#48 (closed))
Dump no longer has helper parameter for stopping after x articles. This can easily be done in a for loop and complicates the code.
Create new Document object for Dumps that holds an article's HTML plus the additional metadata found in the HTML dumps

Functionality:

Transclusion check now also checks parents to see if they were transcluded which more accurately reflects transclusion status of an item. For example, now a wikilink within an infobox should also be marked as transcluded.
Split up media into separate image/audio/video functions for simplicity and consistency
Move is_ functions to be associated directly with their class objects and optimize them to run faster
Remove some unnecessary parameters in Elements. For example, we weren't doing anything special with calculating plaintext so that doesn't need its own attribute. Not all elementst have title/tid so remove from base class. Transclusion now best calculated via is_transcluded util but this is semi-expensive so I don't do it by default.
Simplify string representations for better readability / debugging
Add new Table (wikitable/infobox/sidebar/message box - #29, #53 (closed), #117 (closed), #116 (closed)) object, Hatnote (#118 (closed)), Section, Citation, and Comment classes.
Align get_<element> functions to all return a list of the Element class objects for consistency rather than some returning Elements, some returning strings, etc.
I removed the Template class and just left in a get_templates function. While things like Category, Reference, etc. make sense in the HTML because there are HTML tags that correspond, the Template doesn't really exist in the HTML, it's just a property of another tag (that it was transcluded). So I adjusted it so you could extract them but they weren't treated the same as the other HTML elements.

Plaintext:

Move plaintext functions to separate file (to avoid some weird import circle issues).
Add basic plaintext tests
Plaintext now emits not just text but also transclusion status and the parent node types for that text so the user can decide which types to retain and which to skip -- e.g., skip categories + references + templated content.
Created helper functiton for getting first paragraph that is opinionated about what to exclude. This is mainly to show how one might adjust the more general plaintext function but also might be useful itself.

Edited Jan 03, 2024 by Isaac Johnson

Admin message

Admin message

Full rehaul of HTML dumps looking towards 1.0.0 release

Merge request reports