PHP Stop Word List

I play with stop words in my code. I have an array full of words that I would like to test, and an array of words I want to test.

At the moment, I iterate over the array one at a time and delete the word if it is in_array and the list of stop words, but I wonder if there is a better way to do this, I looked at array_diff, etc. however, if I have several stop words in the first array, only array_diff removes the first occurrence.

The focus is on speed and memory usage, but speed is greater.

Edit -

The first array is singular words based on blog comments (they are usually quite long), the second array is the words of the word "stop words". Sorry for not understanding this.

thanks

+4
performance arrays php words
source share
4 answers

Using str_replace ...

A simple approach is to use str_replace or str_ireplace , which can take an array of "needles" (things to look for), appropriate replacements, and an array of "haystacks" (things to work).

$haystacks=array( "The quick brown fox", "jumps over the ", "lazy dog" ); $needles=array( "the", "lazy", "quick" ); $result=str_ireplace($needles, "", $haystacks); var_dump($result); 

It creates

 array(3) { [0]=> string(11) " brown fox" [1]=> string(12) "jumps over " [2]=> string(4) " dog" } 

As an aside, a quick way to clear the trailing spaces that this leaves is to use array_map to call trim for each element

 $result=array_map("trim", $result); 

The disadvantage of using str_replace is that it replaces matches found in words, not just whole words. To solve this problem, we can use regular expressions ...

Use preg_replace

The preg_replace approach looks very similar to the above, but the needles are regular expressions, and we check the word boundary at the beginning and end of the match using \ b

 $haystacks=array( "For we shall use fortran to", "fortify the general theme", "of this torrent of nonsense" ); $needles=array( '/\bfor\b/i', '/\bthe\b/i', '/\bto\b/i', '/\bof\b/i' ); $result=preg_replace($needles, "", $haystacks); 
+8
source share

If you already have two sorted arrays, you can use this algorithm to remove each element from array A, which is also located in array B (in mathematical expressions: A \ B):

 for ($i=0, $n=count($a), $j=0, $m=count($b); $i<$n && $j<$m; ) { $diff = strcmp($a[$i], $b[$j]); if ($diff == 0) { unset($a[$i]); $i++; } if ($diff < 0) { $i++; } if ($diff > 0) { $j++; } } 

This requires only steps O (n).

Another approach would be to use the words of array B as keys for the index (using array_flip ), iterate over the values โ€‹โ€‹of A and see if they are the key in the index using array_key_exists :

 $index = array_flip($b); foreach ($a as $key => $val) { if (array_key_exists($val, $b)) { unset($a[$key]); } } 

Again, this is O (n), since it avoids looking for every value in B for every value in A, which would be O (n 2 ).

+1
source share

array_diff () should work.

 $sentence = "the quick brown fox jumps the fence and runs"; $array = explode(" ", $sentence); $stopwords = array("the","and","an","of"); print_r(array_diff($array,$stopwords)); 

Result

 Array ( [1] => quick [2] => brown [3] => fox [4] => jumps [6] => fence [8] => runs ) 

I tested on this site: http://sandbox.onlinephpfunctions.com/

+1
source share

how about using in_array

http://au.php.net/manual/en/function.in-array.php

The function takes a needle, which is an array.

bool in_array (mixed $ needle, array $ haystack [, bool $ strict])

alternatively you can scroll the stop words one by one and find all matches

-one
source share

All Articles