Optimization:
1) Take a look at your data. Perhaps you can do a few checks to sort out invalid pairs faster. If the Name length changes by more than 10, you can check if the difference between s.Lenght and t.Length greater than 10 and will immediately return to a large distance (maybe int.MaxValue or just 100). It makes no sense to calculate the distance if it clearly goes beyond.
2) Look for small optimizations. Looping twice as many as 150 thousand. Lines means 22.5 billion iterations. Small changes can have a big effect. You can try to cache lines of objects and remove ToString() by using the Equals() method. I think this will be faster than accessing the ith element of your data type 150,000 times.
for (int i = 0; i < dt1.Rows.Count; i++) { var outerRow = dt1.Rows[i]; for (int j = 0; i + 1 < dt1.Rows.Count; j++) { var innerRow = dt1.Rows[j]; if (Equals(outerRow["NUM"] == innerRow["NUM"])) { if (outerRow["Name"].ToString().LevenshteinDistance(innerRow.ToString()) <= 10) { Logging.Write(...); } } }
3) Try to reduce / break the data sets. Run the query to get all possible NUM values select distinct NUM from myTable . Then for each NUM in your result, execute your original query, but using the where clause and select only the name: SELECT name from myTable where NUM = currentNum .
So you don’t t have to compare the NUM row and you don t choose odd data. Your code can be optimized to perform only the Levenshtein distance, but using the optimizations specified in 1 + 2.
4) Try a different approach, such as full-text search.
I just hat to solve a similar problem, finding matches in a table of 1.2 million rows. I used lucene.net , which gives me real-time results when searching on one or more properties of my strings.
They are also levenshtein, but perhaps faster than your implementation;) MSSQL Server also supports full-text search.
source share