Filter abbreviations list for each language
CONTEXT:
Nazia: My general observation from the list of abbreviations is that it is quite faulty in itself. For example, it contains entries like it. . Now with our current implementation, if a sentence ends in it. and it gets marked as an abbreviation, then we proceed to concatenate it with next sentence segment (logic: if (last word of 1st sentence + 1st word of 2nd sentence) matches an abbreviation -> the sentence was likely split on an abbreviation. so we concatenate the wrongly split segments of the same word ) . To ensure better coverage for sentences like, Have Moly and Co. made it to the program? we also look for the last word of 1st sentence and 1st word of second sentence individually in the abbreviations list. While this individual checking ensures that we concatenate to sentence splits ["Have Moly and Co.", "made it to the party"] as Co. exists in the abvr list - it also causes problems with passages like Just because you've always done it this way, it doesn't mean that it's the best way to do it. Tom claims he has telepathic powers. . Here the splitter at first DOES split on it. , but in post processing finds that it. exists in the abbreviations list. so it concatenates the sentence with the subsequent sentence.
Isaac: Maybe a function that takes in an article text and splits on whitespace and counts how many times a word like "it" appears by itself vs. with the full stop? Doing this across lots of documents would hopefully show that the ratio of Co. to Co is quite high while the ratio of it. to it is quite low. We could use those differences to filter down the abbreviation list
Goals :
-
create separate abbreviation files for each wiktionary with filtration metadata -
create threshold filtered abbr list for each wiki project -
upload related codes and notebooks -
update the sentence segmentation code to consider language-specific abbreviations list -
have a fallback to global list
Process:
- Have a global list of abbreviations from wiktionaries
- For each wiki project, count the number of occurrences for each abbreviation
- For each of these abbreviations, count the number of times they appear without the ending punctuation, across a whole wikiproject
- Calculate ratio, r = occurrences with ending punctuation / (occurrences with + without punctuations)
- Generate metadata of whether a word should be considered an abbreviation based on different threshold levels and different wikiprojects
- Set a specific filter level and extract abbreviations for each of the wikiproject