Fix mismatches between docstrings and code
The docstrings are largely out of date. We should ensure they match the existing behavior and exist for all functions. Good place to start is the WikiStew (src/mwparserfromhtml/parse/article.py
) class, as I suspect that is the prime interface for most users. For each get_...
function there, we'll want to add some details about what it does. For example, for get_images
it currently reads:
def get_images(self) -> typing.List[Media]:
"""
extract images from a BeautifulSoup object.
Returns:
typing.List[Media]: list of image media objects
"""
And we'd want to clean up any grammar/spelling mistakes, explain what an image
actually is, and give any suggestions around usage. We'll follow Google's Python style guide ((docs)[https://google.github.io/styleguide/pyguide.html#383-functions-and-methods]). So perhaps it would say something like:
def get_images(self) -> typing.List[Media]:
"""Extract images from a BeautifulSoup object.
Usage notes:
Many "images" on Wikipedia are actually tiny icons. To filter these out,
it's easiest to set some minimum pixel area for the image. For example,
only including images where:
image.width * image.height > 10000 # 100x100 pixels
Learn more:
* https://commons.wikimedia.org/wiki/Commons:File_types#Images
* https://www.mediawiki.org/wiki/Specs/HTML/2.8.0#Images
Returns:
typing.List[Media]: list of image media objects
"""