Tokenization: Restructure tokenizer code into class
Context: [@isaacj] - I think let's just file this next suggestion as an issue to discuss rather than implementing it now, but another option is to create a class for the tokenizers that you instantiate with a language and it loads in the relevant abbreviations at initialization and sets the correct tokenizers -- e.g., whitespace-delimited or not. We might end up having to do that later anyways and I think in the long-run it will keep things simpler (e.g., store language code once rather than passing it everytime). for example:
class Tokenizer:
def __init__(self, language=None, abbreviation_fn='dict_abbr_filtered.json'):
if language in WIKI_LANGUAGES:
self.language = language
else:
self.language = 'en' # not sure this is the right default but maybe?
try:
with open(abbreviation_fn, 'r') as fin:
self.abbreviations = json.loads(fin)['language']
self.post_process = True
except Exception:
self.abbreviations = {}
self.post_process = False
if language in NON_WHITESPACE_LANGUAGES:
self.sentence_tokenization = ... # function
else:
self.sentence_tokenization = ... # function
[Nazia] - This is also how we have implemented our word tokenizer (see MR#8). So in the next iteration, we can try to reconcile both in a similar fashion.
Approach :
- Have a base tokenizer class that initiates some global attributes (language, preprocessing decision, file location, etc.)
- The tokenizer class contains methods for both word and sentence tokenization
- Funnel the tokenization to different tokenization schemes
- Yields an iterator of string segments