Resolve "Word-tokenization evaluation methodology and datasets" (!13) · Merge requests · repos / research / Wiki NLP Tools

Appledora requested to merge 21-word-tokenization-eval into main Apr 16, 2023

Current evaluation approach :

- use link anchors as potential ground truth tokens
- tokenization is considered correct if the generated tokens are a substrings of the corresponding anchors 
- ignore all tokens that are not overlapping or crossing boundaries with the anchors
- calculate the precision, recall and f1 scores for the algorithm

Precision and Recall are calculate as follows:

precision = 1 - (Number of tokens the cross anchor boundaries / number_of_boundaries)
recall = 1 - ((tokens_required_to_cover_anchors - number_of_anchors) / (number_of_anchor_chars - number_of_anchors))

Closes #21 (closed)

Edited May 06, 2023 by Appledora

Admin message

Admin message

Admin message

Resolve "Word-tokenization evaluation methodology and datasets"

Merge request reports