Skip to content

Resolve "Word-tokenization evaluation methodology and datasets"

Appledora requested to merge 21-word-tokenization-eval into main

Current evaluation approach :

- use link anchors as potential ground truth tokens
- tokenization is considered correct if the generated tokens are a substrings of the corresponding anchors 
- ignore all tokens that are not overlapping or crossing boundaries with the anchors
- calculate the precision, recall and f1 scores for the algorithm

Precision and Recall are calculate as follows:

precision = 1 - (Number of tokens the cross anchor boundaries / number_of_boundaries)
recall = 1 - ((tokens_required_to_cover_anchors - number_of_anchors) / (number_of_anchor_chars - number_of_anchors))

Closes #21 (closed)

Edited by Appledora

Merge request reports