Add tokenization in misspelling finder
Currently we tokenize sentences by white space delimiter (text.split()
) to divide the text into words and find misspellings. But this will not work for non-space delimited languages (Japanese, Chinese, etc). We can use research/wiki-nlp-tools to perform tokenization.
Possible places in code where changes need to be made (for code as far as Issue #7 (closed)):
- tokenization to separate words. In get_data:
for word in text.split(): ...
- splitting and joining again to get quotes and detect language:
section_text = ' '.join(section_text_raw.split())
paragraphs = [" ".join(para.split()) for para in paragraphs]
- In get_len:
text = ' '.join(text.split())
- regex for getting quotes