Compare 5000 Lines with PHP Levenshtein

Question

Compare 5000 Lines with PHP Levenshtein

I have 5000, sometimes more lines of street addresses in the array. I would like to compare them all with levenshtein to find similar matches. How can I do this without looping all 5000 and comparing them directly with all the other 4999s?

Edit: I am also interested in alternative methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on the addresses sent by the user.

+7

database php levenshtein distance similarity street-address

phirschybar Dec 24 '09 at 11:30

source share

9 answers

I think you cannot escape a loop through an array, since the levenstein () function only accepts strings, not an array, as input.

You can do something like:

 for($i=0;$i<count($array)-1;$i++) { for($j=$i+1;$j<count($array);$j++) { $lev = levenshtein($array[$i],$array[$j]); if($lev == 0) { // exact match } else if($lev <= THRESHOLD) { // similar } } }

+3

codaddict Dec 24 '09 at 11:41

source share

You can use bk-tree to speed up the search / comparison.

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

Now we can make a particularly useful remark about the Levenshtein distance: it forms a metric space.
[...]
Suppose that at the moment we have two parameters: the query, the string that we use in our search, and n is the maximum distance that the string can be from the query and still be returned. Let's say we take an arbin line, check and compare it with the request. Call the resulting distance d. Since we know that there is a triangle inequality, all our results should have the largest distance d + n and at least the distance dn from the test.
[...]
Tests show that a search at a distance of 1 query does not exceed 5-8% of the tree, and a search with two errors is no more than 17-25% of the tree - a significant improvement over each node!

edit: But this does not help you with your problem (12 Bird Road, Apt 6 "and" 12 Bird Rd. # 6 "). Only when comparing brute force m * n.

+3

Volkerk Dec 24 '09 at 11:51

source share

Due to the nature of the Levenshtein algorithm (in particular, the fact that this is a comparison between two lines), I do not see how this is possible.

Of course, you could reduce the number of comparisons by fulfilling some basic compliance requirements, but that goes beyond what you ask.

As a (possibly irrelevant) option, you can always use something like soundex , which allows you to pre-compute string values. (You can also use it directly in MySQL, I believe.)

+2

John parker Dec 24 '09 at 11:42

source share

You can group them based on soundexes and then limit the comparison to the nearest N cases ...

  $mashed=array(); foreach ($address as $key=>$val) { $mashed[$key]=soundex($val); } sort($mashed);

Then iterate over the keys from $ mashed.

FROM.

+2

symcbean Dec 24 '09 at 11:42

source share

Given your problem, I see no other way to compare each address with any other address if you want to use Lehvenstein distance .

First of all, you must normalize addressing, get rid of abbreviations, etc.

Ave → Avenue
Rd. → Road

You may have some fixed maximum Lehvenstein (N) distance for similar addresses.

If so, you could interrupt the Lehvenstein algorithm if you are sure that the editing distance for the current pair of addresses is greater than N. For this, you need to write a custom version of the Lehvenstein algorithm. This will make the algorithm a little faster.

There are also some related trivial optimizations. For example: if address A is 10 characters long and address B is 20 characters, and you count addresses that have Lechvenshtein's distance less than 8, it will be the same. You can look at the lengths of addresses and immediately decide that they are not alike.

+1

Juha syrjälä Dec 24 '09 at 11:43

source share

If you want to find all such values, you have to compare all the elements with all the others. But choosing the right array functions will greatly speed up the process. Here is a quick example (an array of results could be better):

 $results = array(); $count = count($entries); while ($count != 0) { # The entry to process $entry = array_shift($entries); # Get levenshtein distances to all others $result = array_map( 'levenshtein', # array_map() needs two arrays, this one is an array consisting of # multiple entries of the value that we are processing array_fill($entry, 0, $count), $toCompare ); $results[] = array($entry => $result); $count--; }

+1

soulmerge Dec 24 '09 at 11:48

source share

You speak...

The overall goal is to find similar entries (and eliminate duplicates) based on the addresses sent by the user.

I say ... use the methods http://semaphorecorp.com/mpdd/mpdd.html

+1

joe snyder Aug 08 '10 at 7:13

source share

 $stringA = "this is php programming language"; $stringB = "this is complete programming script in which java php and all other minor languages include"; echo "string 1---->".$stringA."<br />"; echo "string 2---->".$stringB."<br />"; // changing string to arrays $array1 = explode(' ', $stringA); $array2 = explode(' ', $stringB); // getting same element from two strings $c = array_intersect($array1, $array2); // changing array to the string $d=implode(' ',$c); echo "string same elements---> ".$d."<br />"; // getting difrent element from two arrays $result = array_diff($array2, $array1); // changing array to the string $zem = implode(' ',$result); if (!empty($zem)) { echo "string diffrence---> ".$zem."<br />"; } else { echo "string diffrence--->both strings are same <br />"; } similar_text($stringA, $d, $p); echo " similarity between the string is ".$p."% <br />";

+1

Umber singhera Jun 18 '11 at 7:26

source share

symcbean · Accepted Answer · 2009-12-24T12:04:10+0000

I think the best way to group similar addresses would be this:

create a database with two tables - one for the address (and identifier), one for sound expressions of words or letter numbers in the address (with the foreign key of the address table)
enter the address in upper case, replace everything except [AZ] or [0-9] with a space
divide the address by space, calculate the soundex of each word, leave something with numbers as it is and save it in the soundexes table with the foreign key of the address that you started with

for each address (with id $ target) find the most similar addresses:

SELECT similar.id, similar.address, count(*) FROM adress similar, word cmp, word src WHERE src.address_id=$target AND src.soundex=cmp.soundex AND cmp.address_id=similar.id ORDER BY count(*) LIMIT $some_value;

calculate the difference levenstein between your source address and the top few values returned by the query.

(performing any operations on large arrays is often faster in databases)

Compare 5000 Lines with PHP Levenshtein

More articles: