Complex string comparisons

I am trying to write a function in PHP that takes an array of strings ( needle ) and does a comparison with another array of strings ( haystack ). The purpose of this function is to quickly deliver matching strings for AJAX search, so it should be as fast as possible.

Here is a sample code to illustrate two arrays:

 $needle = array('ba','hot','resta'); $haystack = array( 'Southern Hotel', 'Grange Restaurant & Hotel', 'Austral Hotel', 'Barsmith Hotel', 'Errestas' ); 

While this is pretty easy in itself, the purpose of the comparison is to count the number of needle strings in a haystack .

However, there are three limitations:

  • Case insensitive comparison
  • needle should match only the characters at the beginning of the word. For example, “Hote” will match “Hotel”, but “resta” will not match “Errestas”.
  • We want to count the number of matches needles , not the number of times needle appears. If the place is called "Hotel Hotel Hotel", we need a result of 1 not 3 .

Using the above example, we expect the following associative array:

 $haystack = array( 'Southern Hotel' => 1, 'Grange Restaurant & Hotel' => 2, 'Austral Hotel' => 1, 'Barsmith Hotel' => 2, 'Erresta' => 0 ); 

I am trying to implement a function for this using preg_match_all() and a regular expression that looks like /(\A|\s)(ba|hot|resta)/ . Although this ensures that we only match the beginning of words, it does not account for lines that contain the same needle twice.

I am sending messages to find out if anyone has a solution?

+4
source share
4 answers

I found that you are describing the problem in sufficient detail so that I can use TDD to solve it. Therefore, since I try so hard to be a TDD guy, I wrote tests and a function to pass the tests. Namings may not be perfect, but they change easily. The function algorithm may also not be the best, but now that there are tests, refactoring should be very simple and painless.

Here are the tests:

 class MultiMatcherTest extends PHPUnit_Framework_TestCase { public function testTheComparisonIsCaseInsensitive() { $needles = array('hot'); $haystack = array('Southern Hotel'); $result = match($needles, $haystack); $this->assertEquals(array('Southern Hotel' => 1), $result); } public function testNeedleMatchesOnlyCharsAtBeginningOfWord() { $needles = array('resta'); $haystack = array('Errestas'); $result = match($needles, $haystack); $this->assertEquals(array('Errestas' => 0), $result); } public function testMatcherCountsNeedlesNotOccurences() { $needles = array('hot'); $haystack = array('Southern Hotel', 'Grange Restaurant & Hotel'); $expected = array('Southern Hotel' => 1, 'Grange Restaurant & Hotel' => 1); $result = match($needles, $haystack); $this->assertEquals($expected, $result); } public function testAcceptance() { $needles = array('ba','hot','resta'); $haystack = array( 'Southern Hotel', 'Grange Restaurant & Hotel', 'Austral Hotel', 'Barsmith Hotel', 'Errestas', ); $expected = array( 'Southern Hotel' => 1, 'Grange Restaurant & Hotel' => 2, 'Austral Hotel' => 1, 'Barsmith Hotel' => 2, 'Errestas' => 0, ); $result = match($needles, $haystack); $this->assertEquals($expected, $result); } } 


And here is the function:

 function match($needles, $haystack) { // The default result will containg 0 (zero) occurences for all $haystacks $result = array_combine($haystack, array_fill(0, count($haystack), 0)); foreach ($needles as $needle) { foreach ($haystack as $subject) { $words = str_word_count($subject, 1); // split into words foreach ($words as $word) { if (stripos($word, $needle) === 0) { $result[$subject]++; break; } } } } return $result; } 


Verify that a break statement is required

The following test shows when break is required. Run this test with or without a break statement inside the match function.

 /** * This test demonstrates the purpose of the BREAK statement in the * implementation function. Without it, the needle will be matched twice. * "hot" will be matched for each "Hotel" word. */ public function testMatcherCountsNeedlesNotOccurences2() { $needles = array('hot'); $haystack = array('Southern Hotel Hotel'); $expected = array('Southern Hotel Hotel' => 1); $result = match($needles, $haystack); $this->assertEquals($expected, $result); } 
+7
source

Array and string functions are usually faster in magnitude than regular expressions. It should be pretty easy to do what you want with a combination of array_filter and substr_count .

Greetings

+2
source

@Ionut G. Stan wow, what an answer!

@ Lachlan MacDonald If you have speed problems (try first, not just guess :)), you can use the needle to match the beginning of the line: breaking the haystack during the assembly process by the first letter and repeating only the hay array corresponding to the first char needles.

This will make less than 1/10 of the comparison per needle.

+1
source

You can try:

 $results=Array(); foreach ($haystack as $stack) { $counter=0; $lcstack=strtolower($stack); foreach ($needle as $need) { if (substr($lcstack,0,strlen($need))==strtolower($need)) { $counter++; } } $results[$stack]=$counter; } 
0
source

All Articles