Performing a BLEU assessment to determine the similarity of a sentence

I need to calculate the BLEU score to determine if the two sentences are similar or not. I have read several articles that mainly deal with BLEU ratings for measuring accuracy of a measuring machine. But I need a BLEU score to find out the similarities between sentences in the same language [English]. (i.e.) (both sentences are written in English). Thanks pending.

+4
source share
5 answers

Well, if you just want to calculate BLEU points, it's simple. Consider one sentence as a reference translation, and the other as a candidate translation.

+1
source

Use smoothed BLEU to compare offer levels.

The standard BLEU score used to evaluate machine translation (BLEU: 4) is really meaningful at the case level, since any sentence that does not have at least one 4-gram match will get a score of 0 .

This is because BLEU is actually just a geometric mean n-gram precisions which scales with a fine for brevity to prevent very short sentences with some relevant material from being given inappropriately high scores. Since the geometric mean is calculated by multiplying all the terms that should be included in the mean, having zero for any of the n-gram counters, leads to the fact that the whole score is zero.

If you want to apply BLEU to individual sentences, you better use the smoothed BLEU ( Lin and Och 2004 - see sec. 4), whereby you add 1 n-gram to each account before calculating the n-gram prefixes. This will prevent the null value of any of the n-gram prefixes from being zero, and therefore lead to non-zero values, even if there are no 4-gram matches.

Java implementation

You will find the Java implementation of both BLEU and smooth BLEU in the machine translation package at Stanford Phrasal .

Alternatives

As Andreas already mentioned, you can use an alternative rating, such as Levenstein’s line editing distance . However, one problem using Levenstein’s traditional line editing distance to compare sentences is that it clearly does not know word boundaries.

Other alternatives include:

  • Error rate in messages . This is essentially the Levenshtein distance applied to a sequence of words, not a sequence of characters. It is widely used to evaluate speech recognition systems.
  • Translation Editing Speed ​​(TER) . This is similar to the word error rate, but it allows an additional swap editing operation for adjacent words and phrases. This metric has become popular in the machine translation community because it correlates better with human judgment than other measures of similarity to a sentence, such as BLEU. The most recent version of this metric, known as Translation Edit Rate Plus (TERp) , allows matching synonyms using WordNet, as well as paraphrasing verbose sequences ("dead" ~ = "kick bucket").
  • METEOR . This metric first calculates an alignment that allows arbitrary reordering of words in two sentences being compared. If there are several possible ways to align sentences, METEOR selects one that minimizes the intersection border of the intersection. Like TERp, METEOR allows you to match WordNet synonyms and rephrase sequences of verbose sequences. After alignment, the metric calculates the similarity between the two sentences using the number of matching words to calculate the F-α score , a balanced measure of accuracy and recall, which are then scaled with a fine for the amount of word order scrambling present in the alignment.
+22
source
+4
source

Perhaps the edit distance (Levenshtein) is also an option or a Hamming distance. In any case, a BLEU score is also suitable for work; it measures the similarity of one sentence to a link, so that only makes sense when they are in the same language, for example, with your problem.

+1
source

You can use the Moses multi-bleu script, where you can also use several links: https://github.com/moses-smt/mosesdecoder/blob/RELEASE-2.1.1/scripts/generic/multi-bleu.perl

0
source

All Articles