Tokenization: Restructure tokenizer code into class

Context: [@isaacj] - I think let's just file this next suggestion as an issue to discuss rather than implementing it now, but another option is to create a class for the tokenizers that you instantiate with a language and it loads in the relevant abbreviations at initialization and sets the correct tokenizers -- e.g., whitespace-delimited or not. We might end up having to do that later anyways and I think in the long-run it will keep things simpler (e.g., store language code once rather than passing it everytime). for example:

class Tokenizer:

     def __init__(self, language=None, abbreviation_fn='dict_abbr_filtered.json'):
          if language in WIKI_LANGUAGES:
               self.language = language
          else:
               self.language = 'en'  # not sure this is the right default but maybe?
          try:
               with open(abbreviation_fn, 'r') as fin:
                   self.abbreviations = json.loads(fin)['language']
                   self.post_process = True
          except Exception:
               self.abbreviations = {}
               self.post_process = False
          if language in NON_WHITESPACE_LANGUAGES:
               self.sentence_tokenization = ...  # function
          else:
               self.sentence_tokenization = ...  # function

[Nazia] - This is also how we have implemented our word tokenizer (see MR#8). So in the next iteration, we can try to reconcile both in a similar fashion.

Approach :

Have a base tokenizer class that initiates some global attributes (language, preprocessing decision, file location, etc.)
The tokenizer class contains methods for both word and sentence tokenization
Funnel the tokenization to different tokenization schemes
Yields an iterator of string segments

Edited Feb 22, 2023 by Appledora

Admin message

Admin message

Tokenization: Restructure tokenizer code into class