How to handle duplicate characters in shared strings when using the Jaro String affinity algorithm

Question

How to handle duplicate characters in shared strings when using the Jaro String affinity algorithm

I am trying to determine the definition of a common line between two lines when applying the Jaro line similarity algorithm.

let's say that

 s1 = 'profjohndoe'
 s2 = 'drjohndoe'

In the likeness of Jaro, half the length floor(11/2) - 1 = 4determined by the algorithm s1[i] = s2[j]is considered common ifabs(i-j)<=4

then the mapping matrix

  p r o f j o h n d o e
d 0 0 0 0 0 0 0 0 0 0 0
r 0 1 0 0 0 0 0 0 0 0 0
j 0 0 0 0 1 0 0 0 0 0 0
o 0 0 1 0 0 1 0 0 0 0 0
h 0 0 0 0 0 0 1 0 0 0 0
n 0 0 0 0 0 0 0 1 0 0 0
d 0 0 0 0 0 0 0 0 1 0 0
o 0 0 0 0 0 1 0 0 0 1 0
e 0 0 0 0 0 0 0 0 0 0 1

so:

char_ins1_canfound_ins2 would be 'rojohndoe' (in their presented order in s1);
char_ins2_canfound_ins1 would be 'rjohndoe' (in their presented order in s2).

Now I have a case where regular char strings are not of equal length, how to deal with this?

applying the function "stringdist" in the package "stringdist" will get the following result:

> 1 - stringdist('profjohndoe','drjohndoe',method='jw')
[1] 0.7887205

which look like this:

1/3 * (8/9 + 8/11 + (8-2) / 8) [1] 0.7887205

, stringdist 8. , char_ins1_canfound_ins2, "rojohnde", 6 , 1/3 * (8/9 + 8/11 + (8-3)/8) char_ins1_canfound_ins2, "", 2 , 1/3 * (8/9 + 8/11 + (8-1)/8)

R stringdist ?

!

+4

string algorithm r stringdist jaro-winkler

Haochuan Zhou 23 . '14 9:01

:

838

std::string const char * char *?

721

How would you count occurrences of a string (actually char) inside a string?

629

How to convert char to string?