Detecting String Similarities Using Levenshtein Distance (SLOW)

The query returns 21 million records

They, as I go through this table, are taken forever. What other solutions exist?

SqlDataReader rd = DbInfo.DataRdr(Conn, "SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " + "FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID"); while (rd.Read()) { if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10) { Logging.Write(...); } } public static int LevenshteinDistance(this string s, string t) { if (s == null) throw new ArgumentNullException("s"); if (t == null) throw new ArgumentNullException("t"); int n = s.Length; int m = t.Length; int[,] d = new int[n+1,m+1]; if (n == 0 || m == 0) return Math.Max(m, n); for (int i = 0; i <= n; i++) { d[i, 0] = i; } for (int i = 0; i < m; i++) { d[0, i] = i; } for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { int cost = (t[j] == s[i]) ? 0 : 1; d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost); } } return d[n, m]; } 
+4
source share
5 answers

You can start by using the following as your query, depending on how often the NUM columns are actually equal:

 SELECT a.NAME AS ANAME, b.NAME AS BNAME, other things FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.id < b.id 

Then your SQL query will give you NUM matching pairs of rows that you can call LevenshteinDistance . Sort of:

 DataTable dt1 = new DataTable(); dt1.Load(DbInfo.DataRdr(Conn, "[the query I wrote above]")); for (int i = 0; i < dt1.Rows.Count; i++) { if (dt1.Rows[i]["ANAME"].ToString().LevenshteinDistance(dt1.Rows[i]["BNAME"].ToString()) <= 10) { Logging.Write(...[modify the query so it returns the things you want to log]...); } } 
+3
source

You compare dt1.Rows[i]["Name"].ToString() with dt1.Rows[j]["Name"].ToString() , even when i == j .

Try 0 <= i < dt1.Rows.Count - 1 and i + 1 <= j < dt1.Rows.Count .

In addition, you only register if dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString() , which is probably a faster check. It makes no sense to do Levenshtein if this is false.

EDIT: @John is right about dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString() (both i ?).

+3
source

Optimization:

1) Take a look at your data. Perhaps you can do a few checks to sort out invalid pairs faster. If the Name length changes by more than 10, you can check if the difference between s.Lenght and t.Length greater than 10 and will immediately return to a large distance (maybe int.MaxValue or just 100). It makes no sense to calculate the distance if it clearly goes beyond.

2) Look for small optimizations. Looping twice as many as 150 thousand. Lines means 22.5 billion iterations. Small changes can have a big effect. You can try to cache lines of objects and remove ToString() by using the Equals() method. I think this will be faster than accessing the ith element of your data type 150,000 times.

 for (int i = 0; i < dt1.Rows.Count; i++) { var outerRow = dt1.Rows[i]; for (int j = 0; i + 1 < dt1.Rows.Count; j++) { var innerRow = dt1.Rows[j]; if (Equals(outerRow["NUM"] == innerRow["NUM"])) { if (outerRow["Name"].ToString().LevenshteinDistance(innerRow.ToString()) <= 10) { Logging.Write(...); } } } 

3) Try to reduce / break the data sets. Run the query to get all possible NUM values select distinct NUM from myTable . Then for each NUM in your result, execute your original query, but using the where clause and select only the name: SELECT name from myTable where NUM = currentNum .

So you don’t t have to compare the NUM row and you don t choose odd data. Your code can be optimized to perform only the Levenshtein distance, but using the optimizations specified in 1 + 2.

4) Try a different approach, such as full-text search.

I just hat to solve a similar problem, finding matches in a table of 1.2 million rows. I used lucene.net , which gives me real-time results when searching on one or more properties of my strings.

They are also levenshtein, but perhaps faster than your implementation;) MSSQL Server also supports full-text search.

+1
source

The biggest improvement you can make is to reduce the considered solution space. Since you need a maximum distance of 10, any lines that differ in length greater than 10 cannot be qualified:

 SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID < b.ID WHERE length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10; 

Then, comment on your code and see where the hot spots are. Good article: Find application bottlenecks using Visual Studio Profiler .

And look at Optimizing the Levenshtein Algorithm in C # .

Edit

Chris also noticed that with levenshtein(a,b) == levenshtein(b,a) you only need to select a.ID <b.ID in join, since the matrix is ​​symmetric. This will halve your problem right away.

+1
source

After performing other optimizations mentioned in this thread, you can move the Levenshtein calculation to the server and only SELECT rows matching your Distance criteria. I need this functionality in the project, so I created a library from it, here . The edit distance method used in lib requires only n * 2 memory instead of n * m.

For example, even if on the server you want to perform an EditDistance calculation when the difference in string length is <10, so check this first. Sort of

 SELECT a.NAME as NameA, b.NAME as NameB FROM table a JOIN table b ON a.NUM = b.NUM WHERE a.Id < b.Id AND length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10 OR EditDistance(a.Name, b.Name) < 10 
0
source

All Articles