Typical Jaro distance behavior in JellyFish

I am trying to use jellyfish to work with fuzzy strings. I notice some strange behavior of the jaro_distance algorithm.

I had some problems earlier with the damerau_levenshtein_distance algorithm, which apparently was a bug in the code, which then the user stack posed as a problem in github.

I am not sure what I think about the measures wrong, or if this is a real mistake. I looked at the source code ( http://goo.gl/YVMl8k ), but I am not familiar with C, so itโ€™s hard for me to find out if there is an implementation problem, or I'm just wrong.

Observe the following:

In [1]: S1 = Poverty
In [2]: S2 = Poervty
In [3]: jf.jaro_distance(S3, S4)
Out[3]: 0.95238095

Now, if my understanding of the measure of long distance is correct, I believe that the result should be 0.9285714285

I determined why the calculus goes wrong. To calculate the measure, I believe that the following is true:

(7.0/7.0 + 7.0/7.0 + ((7.0 - (3.0/2.0))/7.0) * (1.0/3.0) = 0.9285714285

The critical number in this expression is 3.0. This number should represent โ€œNumber of matches (but different order of sequence)โ€ (wikipedia). In my opinion, in S1 and S2, characters that correspond, but are in the order of the sequence of differences, are "e", "r", "v".

However, JellyFish seems to identify only two transpositions that are calculated:

(7.0/7.0 + 7.0/7.0 + ((7.0 - (2.0/2.0))/7.0) * (1.0/3.0) = 0.95238095

Am I really wrong about this, or is there something bad in the function?

+4
source share
1 answer

jaro.c, , trans_count, long. , :

trans_count /= 2;

C , . , (POVERTY/POERVTY) 3, 1 2.

? , :

  • , . ( MARTHA-MARHTA 0,944, - - 0,961.)

  • Jaro 1989 paper .

  • Winkler 1990 . , , :

    , .

    , . , , , . , J-W MARTHA-MARHTA 0,9667 (. 1), , , . . , ?

  • 1995 " ( , " , " ), , N_trans, long, , .

    ( MARTHA-MARHTA 0,9708 - " ".)

, . , .

+3

All Articles