Don't fail to iterate dumps
When running scripts/detect_html_tables.py
in the section-topics
repo, we encountered an error that is caused by mwparserfromhtml failing to handle dumps output where a title
may not be present (in an arwiki article)
mwparserfromhtml should handle such incomplete data more gracefully (e.g. skipping the article)
Stack trace from such run:
Traceback (most recent call last):
File "/srv/home/mlitn/section-topics/scripts/detect_html_tables.py", line 146, in <module>
main()
File "/srv/home/mlitn/section-topics/scripts/detect_html_tables.py", line 139, in main
df = spark.createDataFrame(dataset)
File "/home/mlitn/.conda/envs/section_topics_for_dev/lib/python3.10/site-packages/pyspark/sql/session.py", line 675, in createDataFrame
return self._create_dataframe(data, schema, samplingRatio, verifySchema)
File "/home/mlitn/.conda/envs/section_topics_for_dev/lib/python3.10/site-packages/pyspark/sql/session.py", line 700, in _create_dataframe
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/home/mlitn/.conda/envs/section_topics_for_dev/lib/python3.10/site-packages/pyspark/sql/session.py", line 509, in _createFromLocal
data = list(data)
File "/srv/home/mlitn/section-topics/scripts/detect_html_tables.py", line 41, in generate_dataset
for article in html_dump:
File "/home/mlitn/.conda/envs/section_topics_for_dev/lib/python3.10/site-packages/mwparserfromhtml/dump/dump.py", line 71, in read_dump_local
yield Article(article)
File "/home/mlitn/.conda/envs/section_topics_for_dev/lib/python3.10/site-packages/mwparserfromhtml/parse/article.py", line 24, in __init__
self.title = self.parsed_html.title.text
AttributeError: 'NoneType' object has no attribute 'text'