Skip to content

Resolve "Evaluation: update and modify the sentence evaluation code"

Appledora requested to merge 27-eval-bench into main

Updated the old sentence benchmarking to make it compatible with the current tokenizer implementation. Given a json file of following format :

{
"en" = [sentence 1, sentence2 ... sentence100],
"de" = [sentence 1, sentence2 ... sentence100],
"bn" = [sentence 1, sentence2 ... sentence100]
....
}

the benchmarking code outputs a csv file with following columns.

<correct> <partially correct> <incorrect> <missing> <accuracy>

The code also generates a benchmarking log as a csv file:
image

We can identify four types of errors:

  • type 1 (2-no-match): splits into two sentences. But neither the input sentences
  • type 2 (>2-one-match): splits into more than two sentences, with at least one of the sentences
  • type 3 (>2-no-match): splits into more than two sentences, with none of the sentences
  • type 4 (no-split): doesn't split into two sentences

Closes #27 (closed)

Edited by Appledora

Merge request reports