Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • C copyedit-common-misspellings
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Custom issue tracker
    • Custom issue tracker
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • repos
  • research
  • copyedit-common-misspellings
  • Issues
  • #5
Closed
Open
Issue created Feb 23, 2023 by Isaac Johnson@isaacjMaintainer

Add fields to potential misspellings with number of definitions with and without misspelling template

Goal: two new columns:

  • # definitions
  • # definitions with misspelling template

Uses logic below:

  • Split page into language sections
  • For each section, count up number of lines that start with # tag (mwparserfromhell identifies it as a tag with li tag name) and count up how many of those lines then include a misspelling template.

Copying from Issue 1:

Great example. I'm realizing how complicated Wiktionary can be (for all the structure that the community applies, it's still language that is being described which is messy and unstructured wikitext which is messy). I'm wondering what basic assumptions we can make that would help us filter our lists. A few thoughts:
* It feels safe to assume that the L2 headings are always languages -- e.g., English, etc. I don't think this is particularly useful right now though because you're already extracting language-codes from the templates, which is far more direct.
* I don't think we can assume anything about which sections are parts-of-speech and which are other pieces of information like etymology that are not particularly relevant to our goals. We could always generate an allow-list of parts-of-speech to check -- e.g., if section_title in ['noun', 'verb', ..., ] -- and that would probably work fine for English based on your initial work but likely would prevent us from scaling easily to other languages. Same for the head templates -- e.g., `{{en-adj}}` and `{{head|en|misspelling}}` -- which I assume any checks also would be difficult to scale to other languages.
* The use of the `#` character to start lines that are definitions of the word feels potentially consistent and language-agnostic. So if there is only one `#` line and it's a misspelling template, that feels like a confident way of stating that this word is only ever a misspelling.

Based on that, my feeling is that if we want to be able to say: "this word X only ever appears as misspelling in language Y", a reasonably high-precision / high-recall approach might be:
* Split a page into L2 sections (where we assume each is a separate language and therefore can be treated independently)
* For each L2 section, loop through all the lines that begin with `#`. If all of these lines have misspelling templates, then record as misspelling for that language; else assume there are legitimate usages of the spelling and skip.

From a code standpoint, you could also enforce if there are multiple misspelling templates under a L2 section that they all have the same language parameter (otherwise discard that language-spelling pair). The main refactoring of the current merge request would be to explicitly process the L2 sections separately and count up / process the `#` lines before recording misspelling templates.
Assignee
Assign to
Time tracking