Add fields to potential misspellings with number of definitions with and without misspelling template
Goal: two new columns:
# definitions
# definitions with misspelling template
Uses logic below:
- Split page into language sections
- For each section, count up number of lines that start with
#
tag (mwparserfromhell identifies it as a tag withli
tag name) and count up how many of those lines then include a misspelling template.
Copying from Issue 1:
Great example. I'm realizing how complicated Wiktionary can be (for all the structure that the community applies, it's still language that is being described which is messy and unstructured wikitext which is messy). I'm wondering what basic assumptions we can make that would help us filter our lists. A few thoughts:
* It feels safe to assume that the L2 headings are always languages -- e.g., English, etc. I don't think this is particularly useful right now though because you're already extracting language-codes from the templates, which is far more direct.
* I don't think we can assume anything about which sections are parts-of-speech and which are other pieces of information like etymology that are not particularly relevant to our goals. We could always generate an allow-list of parts-of-speech to check -- e.g., if section_title in ['noun', 'verb', ..., ] -- and that would probably work fine for English based on your initial work but likely would prevent us from scaling easily to other languages. Same for the head templates -- e.g., `{{en-adj}}` and `{{head|en|misspelling}}` -- which I assume any checks also would be difficult to scale to other languages.
* The use of the `#` character to start lines that are definitions of the word feels potentially consistent and language-agnostic. So if there is only one `#` line and it's a misspelling template, that feels like a confident way of stating that this word is only ever a misspelling.
Based on that, my feeling is that if we want to be able to say: "this word X only ever appears as misspelling in language Y", a reasonably high-precision / high-recall approach might be:
* Split a page into L2 sections (where we assume each is a separate language and therefore can be treated independently)
* For each L2 section, loop through all the lines that begin with `#`. If all of these lines have misspelling templates, then record as misspelling for that language; else assume there are legitimate usages of the spelling and skip.
From a code standpoint, you could also enforce if there are multiple misspelling templates under a L2 section that they all have the same language parameter (otherwise discard that language-spelling pair). The main refactoring of the current merge request would be to explicitly process the L2 sections separately and count up / process the `#` lines before recording misspelling templates.