Find the most duplicate substrings in an array

I have an array:

$myArray=array(

'hello my name is richard',
'hello my name is paul',
'hello my name is simon',
'hello it doesn\'t matter what my name is'

);

I need to find a substring (min. 2 words), which is repeated most often, maybe in an array format, so my returned array may look like this:

$return=array(

array('hello my', 3),
array('hello my name', 3),
array('hello my name is', 3),
array('my name', 4),
array('my name is', 4),
array('name is', 4),

);

Therefore, I can see from this array of arrays how often each row was repeated among all the rows in the array.

Is this the only way to do it this way? ..

function repeatedSubStrings($array){

    foreach($array as $string){
        $phrases=//Split each string into maximum number of sub strings
        foreach($phrases as $phrase){
            //Then count the $phrases that are in the strings
        }
    }

}

I tried a solution similar to the one described above, but was too slow, processing around 1000 lines per second, can anyone do this faster?

+5
source share
4 answers

The solution to this may be

function getHighestRecurrence($strs){

  /*Storage for individual words*/
  $words = Array();

  /*Process multiple strings*/
  if(is_array($strs))
      foreach($strs as $str)
         $words = array_merge($words, explode(" ", $str));

 /*Prepare single string*/
  else
      $words = explode(" ",$strs);

  /*Array for word counters*/
  $index = Array();

  /*Aggregate word counters*/
  foreach($words as $word)

          /*Increment count or create if it doesn't exist*/
          (isset($index[$word]))? $index[$word]++ : $index[$word] = 1;


  /*Sort array hy highest value and */
  arsort($index);

  /*Return the word*/
  return key($index);
}
+4
source

, , , .

, , ( ), , . "A B" "A" "B" , , , "A B" , , "A B" , . , , .

, , . , .

+1

O (n)

$twoWordPhrases = function($str) {
    $words = preg_split('#\s+#', $str, -1, PREG_SPLIT_NO_EMPTY);
    $phrases = array();
    foreach (range(0, count($words) - 2) as $offset) {
        $phrases[] = array_slice($words, $offset, 2);
    }
    return $phrases;
};
$frequencies = array();
foreach ($myArray as $str) {
    $phrases = $twoWordPhrases($str);
    foreach ($phrases as $phrase) {
        $key = join('/', $phrase);
        if (!isset($frequencies[$key])) {
            $frequencies[$key] = 0;
        }
       $frequencies[$key]++;
    }
}
print_r($frequencies);
0
source

Although this has better performance, I think it is simpler in terms of implementation:

$substrings = array();

foreach ($myArray as $str)
{
    $subArr = explode(" ", $str);
    for ($i=0;$i<count($subArr);$i++)
    {
        $substring = "";
        for ($j=$i;$j<count($subArr);$j++)
        {
            if ($i==0 && ($j==count($subArr)-1))
                break;      
            $substring = trim($substring . " " . $subArr[$j]);
            if (str_word_count($substring, 0) > 1)
            {
                if (array_key_exists($substring, $substrings))
                    $substrings[$substring]++;
                else
                    $substrings[$substring] = 1;
            }
        }
    }   
}

arsort($substrings);
print_r($substrings);
0
source

All Articles