I am trying to use jellyfish to work with fuzzy strings. I notice some strange behavior of the jaro_distance algorithm.
I had some problems earlier with the damerau_levenshtein_distance algorithm, which apparently was a bug in the code, which then the user stack posed as a problem in github.
I am not sure what I think about the measures wrong, or if this is a real mistake. I looked at the source code ( http://goo.gl/YVMl8k ), but I am not familiar with C, so itโs hard for me to find out if there is an implementation problem, or I'm just wrong.
Observe the following:
In [1]: S1 = Poverty
In [2]: S2 = Poervty
In [3]: jf.jaro_distance(S3, S4)
Out[3]: 0.95238095
Now, if my understanding of the measure of long distance is correct, I believe that the result should be 0.9285714285
I determined why the calculus goes wrong. To calculate the measure, I believe that the following is true:
(7.0/7.0 + 7.0/7.0 + ((7.0 - (3.0/2.0))/7.0) * (1.0/3.0) = 0.9285714285
The critical number in this expression is 3.0. This number should represent โNumber of matches (but different order of sequence)โ (wikipedia). In my opinion, in S1 and S2, characters that correspond, but are in the order of the sequence of differences, are "e", "r", "v".
However, JellyFish seems to identify only two transpositions that are calculated:
(7.0/7.0 + 7.0/7.0 + ((7.0 - (2.0/2.0))/7.0) * (1.0/3.0) = 0.95238095
Am I really wrong about this, or is there something bad in the function?