Extract infoboxes from articles
Generally infoboxes can be detected by the presence of a class with the word infobox
in it. This is a norm though that is enforced through template code. While it is not guaranteed to be true, it seems to hold pretty well for most wikis and there is not a more straightforward manner of identifying them. The one hazard is template-content that are not infoboxes but inherited infobox-related code and did not remove the class. Examples provided by Amir Aharoni for Hindi such as e.g., Template:Cookbook adding the Vegetable Wikibook link towards the end of this article:
- BS-map # Railway map
- Cookbook
- Small Egyptian Dynasty List # English, and untranslated
- Wikisource author
- WildlifeofIndia
- WVS # Wikiversity
- कामन्सश्रेणी # Commonscat
- ग्रैंड ट्रंक रोड # Grand Trunk Road
- नरेन्द्र मोदी # Narendra Modi
- बन्धु प्रकल्प # Sisterlinks
- बन्धु प्रक्प # Sister project links
- बौद्ध धार्मिक स्थल # Buddhas holy site
- भारत का संविधान # Constitution of India
- भारत के राष्ट्रीय प्रतीक # National symbols of India
- भारत में इस्लाम # Islam in India
- भारतीय थल सेना के समकक्ष रैंक # Ranks of Indian military
- भारतीय फिल्म सूची # Indian film list
- रेडियो वर्णक्रम # Radio spectrum
- विकिस्रोतनाम # Wikisource Britannica 1911
- सन्दूक हिन्दू धर्म # Hinduism
- सिख धर्म सन्दूक # Sikhs
Some of these might be filtered out by also recording if the infobox appeared in the lede or not and making that an easy filter as many are navigation link boxes, which tend to appear at the end of the article.