I am trying to determine the definition of a common line between two lines when applying the Jaro line similarity algorithm.
let's say that
s1 = 'profjohndoe'
s2 = 'drjohndoe'
In the likeness of Jaro, half the length floor(11/2) - 1 = 4determined by the algorithm s1[i] = s2[j]is considered common ifabs(i-j)<=4
then the mapping matrix
p r o f j o h n d o e
d 0 0 0 0 0 0 0 0 0 0 0
r 0 1 0 0 0 0 0 0 0 0 0
j 0 0 0 0 1 0 0 0 0 0 0
o 0 0 1 0 0 1 0 0 0 0 0
h 0 0 0 0 0 0 1 0 0 0 0
n 0 0 0 0 0 0 0 1 0 0 0
d 0 0 0 0 0 0 0 0 1 0 0
o 0 0 0 0 0 1 0 0 0 1 0
e 0 0 0 0 0 0 0 0 0 0 1
so:
char_ins1_canfound_ins2 would be 'rojohndoe' (in their presented order in s1);
char_ins2_canfound_ins1 would be 'rjohndoe' (in their presented order in s2).
Now I have a case where regular char strings are not of equal length, how to deal with this?
applying the function "stringdist" in the package "stringdist" will get the following result:
> 1 - stringdist('profjohndoe','drjohndoe',method='jw')
[1] 0.7887205
which look like this:
1/3 * (8/9 + 8/11 + (8-2) / 8) [1] 0.7887205
, stringdist 8. , char_ins1_canfound_ins2, "rojohnde", 6 , 1/3 * (8/9 + 8/11 + (8-3)/8) char_ins1_canfound_ins2, "", 2 , 1/3 * (8/9 + 8/11 + (8-1)/8)
R stringdist ?
!