Search if two lines are almost the same

I want to find out if the lines are similar. For example, a string like "Mohan Mehta" should match "Mohan Mehte" and vice versa. Another example: a line like "Umesh Gupta" should match "Umash Gupte".

Basically, one line is correct and the other is incorrect. All my lines are the names of people.

Any suggestions on how to achieve this.

The solution should not be 100% effective.

+7
python string regex
source share
5 answers

You can use difflib.sequencematcher if you want something from stdlib:

from difflib import SequenceMatcher s_1 = 'Mohan Mehta' s_2 = 'Mohan Mehte' print(SequenceMatcher(a=s_1,b=s_2).ratio()) 0.909090909091 

fuzzywuzzy is one of the many libraries you can install, it uses the difflib module with python-Levenshtein . You should also check wikipage for Approximate_string_matching

+8
source share

Another approach is to use the phonetic algorithm :

A phonetic algorithm is an algorithm for indexing words by their pronunciation.

For example, using the soundex algorithm:

 >>> import soundex >>> s = soundex.getInstance() >>> s.soundex("Umesh Gupta") 'U5213' >>> s.soundex("Umash Gupte") 'U5213' >>> s.soundex("Umesh Gupta") == s.soundex("Umash Gupte") True 
+6
source share

What you want is a distance bar . There are many tastes, but I would recommend starting with the Levenshtein distance .

+2
source share

you can see NLTK (Natural Language Toolkit), in particular nltk.metrics , which implements various string distance algorithms, including the already mentioned Levenshtein distance.

+2
source share
 // calculate the similarity between 2 strings public static double similarity(String s1, String s2) { String longer = s1, shorter = s2; if (s1.length() < s2.length()) { // longer should always have greater length longer = s2; shorter = s1; } int longerLength = longer.length(); if (longerLength == 0) { return 1.0; /* both strings are zero length */ } /* // If you have StringUtils, you can use it to calculate the edit distance: return (longerLength - StringUtils.getLevenshteinDistance(longer, shorter)) / (double) longerLength; */ return (longerLength - editDistance(longer, shorter)) / (double) longerLength; } // Example implementation of the Levenshtein Edit Distance // See http://rosettacode.org/wiki/Levenshtein_distance#Java public static int editDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) costs[j] = j; else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) costs[s2.length()] = lastValue; } return costs[s2.length()]; } 
-2
source share

All Articles