Add tokenization in misspelling finder

Currently we tokenize sentences by white space delimiter (text.split()) to divide the text into words and find misspellings. But this will not work for non-space delimited languages (Japanese, Chinese, etc). We can use research/wiki-nlp-tools to perform tokenization.

Possible places in code where changes need to be made (for code as far as Issue #7 (closed)):

tokenization to separate words. In get_data: for word in text.split(): ...
splitting and joining again to get quotes and detect language:
- section_text = ' '.join(section_text_raw.split())
- paragraphs = [" ".join(para.split()) for para in paragraphs]
- In get_len: text = ' '.join(text.split())
regex for getting quotes

Edited Apr 20, 2023 by AKhatun

Admin message

Admin message

Add tokenization in misspelling finder