Determine appropriate level of processing on instantiation of Article object
When an Article object is instantiated, there are a few levels of processing that can go on and we should choose the one that feels like the best balance of functionality without introducing too much overhead:
- Minimal: just create an Article object with the HTML raw string (unparsed) and a few attributes possibly based on the dump metadata. No parsing of HTML.
- Middle: creating the Article object leads to the basic bs4 processing of the HTML from string to DOM. This is the current behavior. Greater overhead in terms of time and memory usage but might be worth it if this doesn't slow down iteration too much and gives access to some basic metadata from the HTML that is useful.
- High: do full processing of HTML into DOM and also extract wiki-specific features. Likely too much overhead to be default but an option.