html-dumps merge requestshttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests2024-02-26T19:41:17Zhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/32Support using an existing file obj2024-02-26T19:41:17ZMatthias MullieSupport using an existing file objThis provides more flexibility; e.g. no longer needs to be a local file
Example use case would be for running within a spark job,
where there is no local file: there's no access to NFS
share & not enough disk space to downloads full dum...This provides more flexibility; e.g. no longer needs to be a local file
Example use case would be for running within a spark job,
where there is no local file: there's no access to NFS
share & not enough disk space to downloads full dumps to,
but I can have the dumps in HDFS, and access them from
within the job like so:
```
cat = subprocess.Popen(['hdfs', 'dfs', '-cat', wiki_dump_path], stdout=subprocess.PIPE)
html_dump = HTMLDump(filepath=wiki_dump_path, fileobj=cat.stdout)
```https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/31Fix citation detection bug2024-02-14T13:39:53ZIsaac JohnsonFix citation detection bugFix citation detection so it doesn't try to make separate citations out of page number suffixes for citations (which throw an error because they have no element id). See #129Fix citation detection so it doesn't try to make separate citations out of page number suffixes for citations (which throw an error because they have no element id). See #129https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/30Adjust Media class to be more robust and capture most captions2024-02-12T15:00:38ZIsaac JohnsonAdjust Media class to be more robust and capture most captionsAdjust Media class so it should usually contain the caption too (exceptions of infoboxes/galleries but those can be covered by infobox/list in plaintext functions). Make media metadata extraction more robust as well.
Relevant specs:
* h...Adjust Media class so it should usually contain the caption too (exceptions of infoboxes/galleries but those can be covered by infobox/list in plaintext functions). Make media metadata extraction more robust as well.
Relevant specs:
* https://www.mediawiki.org/wiki/Specs/HTML/2.8.0/Extensions/Gallery
* https://www.mediawiki.org/wiki/Specs/HTML/2.8.0#Mediahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/29Fix bug with duplicate section text in plaintext and bump minor version2024-02-09T16:06:11ZIsaac JohnsonFix bug with duplicate section text in plaintext and bump minor versionhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/28bump version and expose wikistew2024-02-06T22:34:52ZIsaac Johnsonbump version and expose wikistewhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/27Formatting bug2024-02-06T20:36:48ZIsaac JohnsonFormatting bugFix formatting bug that was preventing plaintext from detecting citations and other nodes that were also text-formatting nodes. Also update README.Fix formatting bug that was preventing plaintext from detecting citations and other nodes that were also text-formatting nodes. Also update README.https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/26Add text-formatting element and reformat stew.2024-02-05T19:55:47ZIsaac JohnsonAdd text-formatting element and reformat stew.Only entry-point into wiki extraction methods assumed
that the full article was present, which meant you had
a bunch of methods (get_title etc.) that would return
errors if you for instance wanted to extract text-formatting
from just a s...Only entry-point into wiki extraction methods assumed
that the full article was present, which meant you had
a bunch of methods (get_title etc.) that would return
errors if you for instance wanted to extract text-formatting
from just a single section. So now Article class has
full-html-level functions but WikiStew class has wiki
extractors and can be used to wrap any subsection of HTML.
In future, I'll try to make it easier to extract just a
single section without parsing the full document first via
SoupStrainer functionality of bs4.https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/25Fallback table for plaintext2024-02-02T21:20:06ZIsaac JohnsonFallback table for plaintextMake table fallback in plaintext for table elements that don't match a standard type. This covers things like sports boxes which use tables for layout but are neither wikitables nor other recognized tables like infoboxes. Not a full elem...Make table fallback in plaintext for table elements that don't match a standard type. This covers things like sports boxes which use tables for layout but are neither wikitables nor other recognized tables like infoboxes. Not a full element but allows for filtering in plaintext.https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/24Add support for list detection2024-02-02T21:14:15ZIsaac JohnsonAdd support for list detectionhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/23Add in marker if whole paragraph is transcluded.2024-02-02T18:20:07ZIsaac JohnsonAdd in marker if whole paragraph is transcluded.Split plaintext generator into paragraphs (not just sections).
If all elements in a paragraph are transcluded, give option to skip.
This is because e.g., coordinates templates generate text that
is technically in a <p> node but many woul...Split plaintext generator into paragraphs (not just sections).
If all elements in a paragraph are transcluded, give option to skip.
This is because e.g., coordinates templates generate text that
is technically in a <p> node but many would want to skip and
in particular it appears first in an article so could be mistaken
as an article lede but the nature of the text (harder to filter on)
and fact that is fully-transcluded are the only signals.https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/22Update tests2024-02-01T14:43:09ZIsaac JohnsonUpdate tests* Add another test article and a few more tests
* Only test multiple articles on given feature if they clearly differ* Add another test article and a few more tests
* Only test multiple articles on given feature if they clearly differhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/21Adjust elements and plaintext2024-01-24T22:40:24ZIsaac JohnsonAdjust elements and plaintext* Add Math element
* Change Table/Hatnote to more specific Infobox, Wikitable, Navigation, Messagebox, Note
* Update plaintext to not track transclusion (poor proxy)
* Update plaintext to instead track paragraph relation and the new elem...* Add Math element
* Change Table/Hatnote to more specific Infobox, Wikitable, Navigation, Messagebox, Note
* Update plaintext to not track transclusion (poor proxy)
* Update plaintext to instead track paragraph relation and the new elements (better proxy)
Addresses #121 (reorg of classes), #116 (now correctly extracting navboxes), #43 (ignore transclusion -- use other, better indicators), #42 (split by sections and paragraphs can be easily split by `\n` as shown in `get_first_paragraph`)Isaac JohnsonIsaac Johnsonhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/20Full rehaul of HTML dumps looking towards 1.0.0 release2024-01-04T15:18:15ZIsaac JohnsonFull rehaul of HTML dumps looking towards 1.0.0 releaseFull rehaul of HTML dumps to standardize, add more flexibility to use outside of HTML dumps, and make plaintext generation more extendable. More detailed list of changes:
Miscellaneous/Packaging:
* Update documentation
* Separate code f...Full rehaul of HTML dumps to standardize, add more flexibility to use outside of HTML dumps, and make plaintext generation more extendable. More detailed list of changes:
Miscellaneous/Packaging:
* Update documentation
* Separate code for generating constants as utils outside of core installed library
* Structure tests so running them requires installing editable local version of code to align with best practices. Contribution guide updated to reflect and all imports switched from relative to absolute
* Remove unused constants
Dumps:
* No longer require additional metadata from Enterprise dump to be present to process an article's HTML. (#48)
* Dump no longer has helper parameter for stopping after x articles. This can easily be done in a for loop and complicates the code.
* Create new Document object for Dumps that holds an article's HTML plus the additional metadata found in the HTML dumps
Functionality:
* Transclusion check now also checks parents to see if they were transcluded which more accurately reflects transclusion status of an item. For example, now a wikilink within an infobox should also be marked as transcluded.
* Split up media into separate image/audio/video functions for simplicity and consistency
* Move is_<nodetype> functions to be associated directly with their class objects and optimize them to run faster
* Remove some unnecessary parameters in Elements. For example, we weren't doing anything special with calculating plaintext so that doesn't need its own attribute. Not all elementst have title/tid so remove from base class. Transclusion now best calculated via is_transcluded util but this is semi-expensive so I don't do it by default.
* Simplify string representations for better readability / debugging
* Add new Table (wikitable/infobox/sidebar/message box - #29, #53, #117, #116) object, Hatnote (#118), Section, Citation, and Comment classes.
* Align `get_<element>` functions to all return a list of the Element class objects for consistency rather than some returning Elements, some returning strings, etc.
* I removed the Template class and just left in a get_templates function. While things like Category, Reference, etc. make sense in the HTML because there are HTML tags that correspond, the Template doesn't really exist in the HTML, it's just a property of another tag (that it was transcluded). So I adjusted it so you could extract them but they weren't treated the same as the other HTML elements.
Plaintext:
* Move plaintext functions to separate file (to avoid some weird import circle issues).
* Add basic plaintext tests
* Plaintext now emits not just text but also transclusion status and the parent node types for that text so the user can decide which types to retain and which to skip -- e.g., skip categories + references + templated content.
* Created helper functiton for getting first paragraph that is opinionated about what to exclude. This is mainly to show how one might adjust the more general plaintext function but also might be useful itself.Isaac JohnsonIsaac Johnsonhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/19Expose real page title from Article2023-06-20T17:16:28ZMatthias MullieExpose real page title from ArticleFixes #52Fixes #52https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/18Don't fail to iterate dumps2023-06-22T10:51:01ZMatthias MullieDon't fail to iterate dumpsThis essentially simply skips articles that could not be initialized
(e.g. malformed data) rather than causing complete failure that
consumers can't recover from.
Fixes #119This essentially simply skips articles that could not be initialized
(e.g. malformed data) rather than causing complete failure that
consumers can't recover from.
Fixes #119https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/17Fix flake8, simplify to just Python3, clean-up contribution guide2022-12-19T20:32:38ZIsaac JohnsonFix flake8, simplify to just Python3, clean-up contribution guideCloses #51 and does a few small other tweaks to the contribution guide / pre-commitCloses #51 and does a few small other tweaks to the contribution guide / pre-commithttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/16feature: metadata extraction2022-08-23T18:57:49ZAppledorafeature: metadata extractionExtracts all the metadata from the dump json, except for `["article_body", "url", "namespace", "name", "in_language"]` keys.
Closes #41Extracts all the metadata from the dump json, except for `["article_body", "url", "namespace", "name", "in_language"]` keys.
Closes #41AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/15Initial template for packaging2022-08-23T19:04:52ZIsaac JohnsonInitial template for packagingCloses #35 -- then just have to actually package and upload to PyPi
Note: would benefit from #39 and #40 being resolved first and then rebasing.Closes #35 -- then just have to actually package and upload to PyPi
Note: would benefit from #39 and #40 being resolved first and then rebasing.https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/14Resolve "Create Documentation"2022-08-22T15:17:41ZAppledoraResolve "Create Documentation"Have started to create a basic README based documentation structure. The `Example Usage` syntaxes would change slightly after we deploy to PyPI repo.
Closes #39Have started to create a basic README based documentation structure. The `Example Usage` syntaxes would change slightly after we deploy to PyPI repo.
Closes #39https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/13feature: extract image audio and video media2022-08-23T19:01:11ZAppledorafeature: extract image audio and video media
Closes #27
Closes #27AppledoraAppledora