Available Tokenizer Analysis
Before going through the older tokenizer implementations for Wikipedia, we should go through the tokenization process of the pre-established NLP packages like NLTK, Gensim and Spacy.
-
Number of languages supported -
available regex/patterns/punctuations list -
Internal tokenizer implementatio