Resolve "Word-tokenization evaluation methodology and datasets"
Current evaluation approach :
- use link anchors as potential ground truth tokens
- tokenization is considered correct if the generated tokens are a substrings of the corresponding anchors
- ignore all tokens that are not overlapping or crossing boundaries with the anchors
- calculate the precision, recall and f1 scores for the algorithm
Precision
and Recall
are calculate as follows:
precision = 1 - (Number of tokens the cross anchor boundaries / number_of_boundaries)
recall = 1 - ((tokens_required_to_cover_anchors - number_of_anchors) / (number_of_anchor_chars - number_of_anchors))
Closes #21 (closed)