Packaging: add sentence segmentation tests
This will help with unit-testing and also provide a simple way for us to play with the code locally without having to set up a special environment. A few edge cases I can think of right now for sentence_tokenization_naive
(we could also separately test pre_processing and the global split pattern but I think okay to just "test" that as part of the fulller tokenization function):
- empty:
''
- just whitespace:
' '
- sentence with no text:
' . '
- sentence w/o end punctuation:
'This is a sentence'
- well-formed sentences:
'This is a sentence. And another.'
- sentences w/ abbreviation:
'This is Q.E.D. a sentence. And another.'