Sentence Tokenization: use existing multi-lingual translation or other sentence level datasets as Evaluation dataset
The current sentence dataset we use for our segmentation evaluation task doesn't contain a good distribution of abbreviations of other edge cases. So it is not possible to measure the post-processing performances on those. Some sentence-level datasets for translations and other NLP tasks may be processed to build a better eval dataset.
Criteria: Manual annotation