Use smoothed BLEU to compare offer levels.
The standard BLEU score used to evaluate machine translation (BLEU: 4) is really meaningful at the case level, since any sentence that does not have at least one 4-gram match will get a score of 0 .
This is because BLEU is actually just a geometric mean n-gram precisions which scales with a fine for brevity to prevent very short sentences with some relevant material from being given inappropriately high scores. Since the geometric mean is calculated by multiplying all the terms that should be included in the mean, having zero for any of the n-gram counters, leads to the fact that the whole score is zero.
If you want to apply BLEU to individual sentences, you better use the smoothed BLEU ( Lin and Och 2004 - see sec. 4), whereby you add 1 n-gram to each account before calculating the n-gram prefixes. This will prevent the null value of any of the n-gram prefixes from being zero, and therefore lead to non-zero values, even if there are no 4-gram matches.
Java implementation
You will find the Java implementation of both BLEU and smooth BLEU in the machine translation package at Stanford Phrasal .
Alternatives
As Andreas already mentioned, you can use an alternative rating, such as Levenstein’s line editing distance . However, one problem using Levenstein’s traditional line editing distance to compare sentences is that it clearly does not know word boundaries.
Other alternatives include:
- Error rate in messages . This is essentially the Levenshtein distance applied to a sequence of words, not a sequence of characters. It is widely used to evaluate speech recognition systems.
- Translation Editing Speed (TER) . This is similar to the word error rate, but it allows an additional swap editing operation for adjacent words and phrases. This metric has become popular in the machine translation community because it correlates better with human judgment than other measures of similarity to a sentence, such as BLEU. The most recent version of this metric, known as Translation Edit Rate Plus (TERp) , allows matching synonyms using WordNet, as well as paraphrasing verbose sequences ("dead" ~ = "kick bucket").
- METEOR . This metric first calculates an alignment that allows arbitrary reordering of words in two sentences being compared. If there are several possible ways to align sentences, METEOR selects one that minimizes the intersection border of the intersection. Like TERp, METEOR allows you to match WordNet synonyms and rephrase sequences of verbose sequences. After alignment, the metric calculates the similarity between the two sentences using the number of matching words to calculate the F-α score , a balanced measure of accuracy and recall, which are then scaled with a fine for the amount of word order scrambling present in the alignment.
dmcer source share