Word Model: 2 PHP Functions, Same Results: Why?

I have two PHP functions for calculating the relationship between two texts. They both use the word bag model, but check2 () is much faster. In any case, both functions give the same results. What for? check1 () uses one large dictionary array containing ALL words - as described in the word model bag. check2 () does not use one large array, but an array containing only words of the same text. Therefore, check2 () should not work, but it is not. Why do both functions give the same results?

function check1($terms_in_article1, $terms_in_article2) { global $zeit_check1; $zeit_s = microtime(TRUE); $length1 = count($terms_in_article1); // number of words $length2 = count($terms_in_article2); // number of words $all_terms = array_merge($terms_in_article1, $terms_in_article2); $all_terms = array_unique($all_terms); foreach ($all_terms as $all_termsa) { $term_vector1[$all_termsa] = 0; $term_vector2[$all_termsa] = 0; } foreach ($terms_in_article1 as $terms_in_article1a) { $term_vector1[$terms_in_article1a]++; } foreach ($terms_in_article2 as $terms_in_article2a) { $term_vector2[$terms_in_article2a]++; } $score = 0; foreach ($all_terms as $all_termsa) { $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa]; } $score = $score/($length1*$length2); $score *= 500; // for better readability $zeit_e = microtime(TRUE); $zeit_check1 += ($zeit_e-$zeit_s); return $score; } function check2($terms_in_article1, $terms_in_article2) { global $zeit_check2; $zeit_s = microtime(TRUE); $length1 = count($terms_in_article1); // number of words $length2 = count($terms_in_article2); // number of words $score_table = array(); foreach($terms_in_article1 as $term){ if(!isset($score_table[$term])) $score_table[$term] = 0; $score_table[$term] += 1; } $score_table2 = array(); foreach($terms_in_article2 as $term){ if(isset($score_table[$term])){ if(!isset($score_table2[$term])) $score_table2[$term] = 0; $score_table2[$term] += 1; } } $score = 0; foreach($score_table2 as $key => $entry){ $score += $score_table[$key] * $entry; } $score = $score/($length1*$length2); $score *= 500; $zeit_e = microtime(TRUE); $zeit_check2 += ($zeit_e-$zeit_s); return $score; } 

I hope you help me. Thanks in advance!

+4
source share
2 answers

Both functions implement almost the same algorithm, but while the first does it in a simple way, the second is a little smarter and skips some unnecessary work.

check1 looks something like this:

 // loop length(words1) times for each word in words1: freq1[word]++ // loop length(words2) times for each word in words2: freq2[word]++ // loop length(union(words1, words2)) times for each word in union(words1, words2): score += freq1[word] * freq2[word] 

But remember: when you multiply something with zero, you get zero.

This means that counting the frequencies of words that are not included in both sets is a waste of time - we multiply the frequency by zero and add nothing to the score.

check2 takes this into account:

 // loop length(words1) times for each word in words1: freq1[word]++ // loop length(words2) times for each word in words2: if freq1[word] > 0: freq2[word]++ // loop length(intersection(words1, words2)) times for each word in freq2: score += freq1[word] * freq2[word] 
+3
source

since you seem to be concerned about performance, here is an optimized version of the algorithm in your check2 function that uses some more built-in functions to increase speed.

 function check ($terms1, $terms2) { $counts1 = array_count_values($terms1); $totalScore = 0; foreach ($terms2 as $term) { if (isset($counts1[$term])) $totalScore += $counts1[$term]; } return $totalScore * 500 / (count($terms1) * count($terms2)); } 
+6
source

All Articles