How to handle duplicate characters in shared strings when using the Jaro String affinity algorithm

I am trying to determine the definition of a common line between two lines when applying the Jaro line similarity algorithm.

let's say that

 s1 = 'profjohndoe'
 s2 = 'drjohndoe'

In the likeness of Jaro, half the length floor(11/2) - 1 = 4determined by the algorithm s1[i] = s2[j]is considered common ifabs(i-j)<=4

then the mapping matrix

  p r o f j o h n d o e
d 0 0 0 0 0 0 0 0 0 0 0
r 0 1 0 0 0 0 0 0 0 0 0
j 0 0 0 0 1 0 0 0 0 0 0
o 0 0 1 0 0 1 0 0 0 0 0
h 0 0 0 0 0 0 1 0 0 0 0
n 0 0 0 0 0 0 0 1 0 0 0
d 0 0 0 0 0 0 0 0 1 0 0
o 0 0 0 0 0 1 0 0 0 1 0
e 0 0 0 0 0 0 0 0 0 0 1

so:

char_ins1_canfound_ins2 would be 'rojohndoe' (in their presented order in s1);
char_ins2_canfound_ins1 would be 'rjohndoe' (in their presented order in s2).

Now I have a case where regular char strings are not of equal length, how to deal with this?

applying the function "stringdist" in the package "stringdist" will get the following result:

> 1 - stringdist('profjohndoe','drjohndoe',method='jw')
[1] 0.7887205

which look like this:

1/3 * (8/9 + 8/11 + (8-2) / 8) [1] 0.7887205

, stringdist 8. ,   char_ins1_canfound_ins2, "rojohnde", 6 , 1/3 * (8/9 + 8/11 + (8-3)/8)   char_ins1_canfound_ins2, "", 2 , 1/3 * (8/9 + 8/11 + (8-1)/8)

R stringdist ?

!

+4

All Articles