Word Tokenization: treat numbers as punctuation
In the current implementation of word tokenization, numerals are considered as alphabetic letters and split using the same logic. However, numerals are often used to enumerate text segments e.g: 1. Here is a text 2) this is another text
. Ideally, in these cases we want the numeric tokens to be 1.
and 2)
instead of stripping of the exterior punctuations as separate tokens. Considering numerals as punctuation, would automatically fall into our logic of clumping punctuations together.
To-do:
- This has to be language-agnostic, so we would need a list of numerals for all languages.