html-dumps merge requestshttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests2022-06-21T13:18:31Zhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/1Feature/structure2022-06-21T13:18:31ZAppledoraFeature/structure- [x] Created initial structure of the library that looks like the following :
```bash
.
├── docs
├── README.md
├── src
│ ├── dump
│ │ ├── dump.py
│ │ ├── __init__.py
│ ├── __init__.py
│ ├── parse
│ │ ├── data.py
│ │...- [x] Created initial structure of the library that looks like the following :
```bash
.
├── docs
├── README.md
├── src
│ ├── dump
│ │ ├── dump.py
│ │ ├── __init__.py
│ ├── __init__.py
│ ├── parse
│ │ ├── data.py
│ │ ├── __init__.py
│ │ └── utils.py
│ └── temp_test.ipynb
└── tests
├── __init__.py
└── test_dump.py
```
- [x] Created separate class files for `HTML Dumps` and `Articles`
- [x] Included the current `requirements.txt`
- [x] Implemented methods :
- `get_html()`
- `get_comments()`
- `get_headers()`
- `get_sections()`AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/3Create initial structure for tests with headings as example.2022-06-23T18:11:53ZIsaac JohnsonCreate initial structure for tests with headings as example.Create initial structure for tests with headings as example. Run via `pytest` from top-level of repo.
Closes #4Create initial structure for tests with headings as example. Run via `pytest` from top-level of repo.
Closes #4https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/2Feature/wikilink : issue 52022-06-27T12:44:11ZAppledoraFeature/wikilink : issue 5- Created a Base Element class
- Extended it for WikiLinks
- Categorized Wikilinks into subclasses
Solves issue #5- Created a Base Element class
- Extended it for WikiLinks
- Categorized Wikilinks into subclasses
Solves issue #5AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/5feature: extract external links2022-07-01T16:35:47ZAppledorafeature: extract external linksCloses #13Closes #13https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/4feature: extract categories and normalize category links2022-07-05T15:15:47ZAppledorafeature: extract categories and normalize category linksCloses #7Closes #7https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/6add static namespace list and utility for generating it to help with namespac...2022-07-05T19:39:15ZIsaac Johnsonadd static namespace list and utility for generating it to help with namespace...add static namespace list and utility for generating it to help with namespace detection for wikilinks
Closes #6add static namespace list and utility for generating it to help with namespace detection for wikilinks
Closes #6https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/7feature: template extraction method2022-07-11T18:05:28ZAppledorafeature: template extraction methodCloses #14Closes #14AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/8write test for existing extraction method2022-07-14T20:45:52ZAppledorawrite test for existing extraction methodThis MR is essentially a patch collection of several issues from #17 to #25 (except for #22).
Tests were implemented using pytest and verified by running `pytest -v`.
Closes #21This MR is essentially a patch collection of several issues from #17 to #25 (except for #22).
Tests were implemented using pytest and verified by running `pytest -v`.
Closes #21AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/9Resolve "add function to extract references to library"2022-07-15T14:16:53ZAppledoraResolve "add function to extract references to library"References are identified by looking for `{"class": "mw-reference-text"}` attributes inside `<span>` tags. We also store the `id` of references that can help track the position where the reference was used.
- [x] testing functions
Clos...References are identified by looking for `{"class": "mw-reference-text"}` attributes inside `<span>` tags. We also store the `id` of references that can help track the position where the reference was used.
- [x] testing functions
Closes #28AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/11Resolve "add functions to extract plaintexts to library"2022-08-18T08:00:16ZAppledoraResolve "add functions to extract plaintexts to library"See this notebook for the outputs : https://public.paws.wmcloud.org/User:Isaac_(WMF)/Outreachy%20Summer%202022/plaintext_exp5.ipynb
**No test written for this yet**
Closes #32See this notebook for the outputs : https://public.paws.wmcloud.org/User:Isaac_(WMF)/Outreachy%20Summer%202022/plaintext_exp5.ipynb
**No test written for this yet**
Closes #32AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/12feature: added namespace attribute to Wikilink instances, language attribute...2022-08-23T15:27:17ZAppledorafeature: added namespace attribute to Wikilink instances, language attribute...feature: added namespace attribute to Wikilink instances, language attribute to Article class, link attribute to Category, Wikilink, ExternalLink and Template
Closes #22feature: added namespace attribute to Wikilink instances, language attribute to Article class, link attribute to Category, Wikilink, ExternalLink and Template
Closes #22AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/14Resolve "Create Documentation"2022-08-22T15:17:41ZAppledoraResolve "Create Documentation"Have started to create a basic README based documentation structure. The `Example Usage` syntaxes would change slightly after we deploy to PyPI repo.
Closes #39Have started to create a basic README based documentation structure. The `Example Usage` syntaxes would change slightly after we deploy to PyPI repo.
Closes #39https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/15Initial template for packaging2022-08-23T19:04:52ZIsaac JohnsonInitial template for packagingCloses #35 -- then just have to actually package and upload to PyPi
Note: would benefit from #39 and #40 being resolved first and then rebasing.Closes #35 -- then just have to actually package and upload to PyPi
Note: would benefit from #39 and #40 being resolved first and then rebasing.https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/16feature: metadata extraction2022-08-23T18:57:49ZAppledorafeature: metadata extractionExtracts all the metadata from the dump json, except for `["article_body", "url", "namespace", "name", "in_language"]` keys.
Closes #41Extracts all the metadata from the dump json, except for `["article_body", "url", "namespace", "name", "in_language"]` keys.
Closes #41AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/13feature: extract image audio and video media2022-08-23T19:01:11ZAppledorafeature: extract image audio and video media
Closes #27
Closes #27AppledoraAppledorahttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/17Fix flake8, simplify to just Python3, clean-up contribution guide2022-12-19T20:32:38ZIsaac JohnsonFix flake8, simplify to just Python3, clean-up contribution guideCloses #51 and does a few small other tweaks to the contribution guide / pre-commitCloses #51 and does a few small other tweaks to the contribution guide / pre-commithttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/19Expose real page title from Article2023-06-20T17:16:28ZMatthias MullieExpose real page title from ArticleFixes #52Fixes #52https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/18Don't fail to iterate dumps2023-06-22T10:51:01ZMatthias MullieDon't fail to iterate dumpsThis essentially simply skips articles that could not be initialized
(e.g. malformed data) rather than causing complete failure that
consumers can't recover from.
Fixes #119This essentially simply skips articles that could not be initialized
(e.g. malformed data) rather than causing complete failure that
consumers can't recover from.
Fixes #119https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/20Full rehaul of HTML dumps looking towards 1.0.0 release2024-01-04T15:18:15ZIsaac JohnsonFull rehaul of HTML dumps looking towards 1.0.0 releaseFull rehaul of HTML dumps to standardize, add more flexibility to use outside of HTML dumps, and make plaintext generation more extendable. More detailed list of changes:
Miscellaneous/Packaging:
* Update documentation
* Separate code f...Full rehaul of HTML dumps to standardize, add more flexibility to use outside of HTML dumps, and make plaintext generation more extendable. More detailed list of changes:
Miscellaneous/Packaging:
* Update documentation
* Separate code for generating constants as utils outside of core installed library
* Structure tests so running them requires installing editable local version of code to align with best practices. Contribution guide updated to reflect and all imports switched from relative to absolute
* Remove unused constants
Dumps:
* No longer require additional metadata from Enterprise dump to be present to process an article's HTML. (#48)
* Dump no longer has helper parameter for stopping after x articles. This can easily be done in a for loop and complicates the code.
* Create new Document object for Dumps that holds an article's HTML plus the additional metadata found in the HTML dumps
Functionality:
* Transclusion check now also checks parents to see if they were transcluded which more accurately reflects transclusion status of an item. For example, now a wikilink within an infobox should also be marked as transcluded.
* Split up media into separate image/audio/video functions for simplicity and consistency
* Move is_<nodetype> functions to be associated directly with their class objects and optimize them to run faster
* Remove some unnecessary parameters in Elements. For example, we weren't doing anything special with calculating plaintext so that doesn't need its own attribute. Not all elementst have title/tid so remove from base class. Transclusion now best calculated via is_transcluded util but this is semi-expensive so I don't do it by default.
* Simplify string representations for better readability / debugging
* Add new Table (wikitable/infobox/sidebar/message box - #29, #53, #117, #116) object, Hatnote (#118), Section, Citation, and Comment classes.
* Align `get_<element>` functions to all return a list of the Element class objects for consistency rather than some returning Elements, some returning strings, etc.
* I removed the Template class and just left in a get_templates function. While things like Category, Reference, etc. make sense in the HTML because there are HTML tags that correspond, the Template doesn't really exist in the HTML, it's just a property of another tag (that it was transcluded). So I adjusted it so you could extract them but they weren't treated the same as the other HTML elements.
Plaintext:
* Move plaintext functions to separate file (to avoid some weird import circle issues).
* Add basic plaintext tests
* Plaintext now emits not just text but also transclusion status and the parent node types for that text so the user can decide which types to retain and which to skip -- e.g., skip categories + references + templated content.
* Created helper functiton for getting first paragraph that is opinionated about what to exclude. This is mainly to show how one might adjust the more general plaintext function but also might be useful itself.Isaac JohnsonIsaac Johnsonhttps://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/21Adjust elements and plaintext2024-01-24T22:40:24ZIsaac JohnsonAdjust elements and plaintext* Add Math element
* Change Table/Hatnote to more specific Infobox, Wikitable, Navigation, Messagebox, Note
* Update plaintext to not track transclusion (poor proxy)
* Update plaintext to instead track paragraph relation and the new elem...* Add Math element
* Change Table/Hatnote to more specific Infobox, Wikitable, Navigation, Messagebox, Note
* Update plaintext to not track transclusion (poor proxy)
* Update plaintext to instead track paragraph relation and the new elements (better proxy)
Addresses #121 (reorg of classes), #116 (now correctly extracting navboxes), #43 (ignore transclusion -- use other, better indicators), #42 (split by sections and paragraphs can be easily split by `\n` as shown in `get_first_paragraph`)Isaac JohnsonIsaac Johnson