Speeding up levenshtein / Similar_text in PHP

Question

Speeding up levenshtein / Similar_text in PHP

I am currently using Similar_text to compare a string with a list of ~ 50,000, which works, although due to the number of comparisons it is very slow. It takes about 11 minutes to compare ~ 500 unique strings.

Before starting this, I will check the databases to see if this has been handled in the past, so every time after the initial start it is close to instant.

I am sure that using levenshtein will be a little faster, and the LevenshteinDistance feature published in the manual looks interesting. Did I miss something that could make it significantly faster?

+6

php levenshtein distance similarity

Dancake Aug 1 '09 at 2:56

source share

3 answers

When using the levenshtein automaton (an automaton that matches a line with distance k ), you can perform a conformance check in O(n) , where n is the length of the line you are checking. Building an automaton takes O(kn) , where k is the maximum distance and length n base line.

+2

Egon Oct 23 '09 at 8:31

source share

Perhaps you could short-circuit some checks by first comparing your string for an exact match (and first compare if the length is the same) and if it misses the more expensive similar_text call.

As Jason noted, the O (N ^ 3) algorithm will never be a good choice.

+1

Mitch wheat Aug 1 '09 at 3:29

source share

Dancake · Accepted Answer · 2009-08-05T05:24:59+0000

In the end, both levenshtein and similar_text were too slow with the number of lines it had to go through, even with a lot of checks, and only using one of them as a last resort.

As an experiment, I ported part of the code to C # to see how much faster it would be overflowing code. It worked after about 3 minutes with the same dataset.

Then I added an extra field to the table and used the PECL double metaphone extension to generate keys for each row. The results were good, though, as some of the numbers included caused duplicates. I think I could run each of these functions, but decided not to.

In the end, I chose the simplest approach, the full MySQL text, which worked very well. Sometimes there are errors, although they are easy to detect and correct. It also works very fast, after about 3-4 seconds.

Speeding up levenshtein / Similar_text in PHP

More articles: