Forerunners
There is (as was repeatedly noted in the comments), to make out whole rooms for you - and / or your code - to get into the implementation of such a function, to name a few:
- People will add characters to trick the filter.
- People will become creative (e.g. innuendo)
- People will use passive aggression and sarcasm.
- People will use sentences / phrases not only for words
You better implement a system of measurements / flags, where people can mark offensive comments, which can then be edited / deleted by mods, users, etc.
In this understanding, let's continue ...
Decision
Given that you:
- List of banned words
$bad_words - Enter a list of replacement words
$good_words - Want to replace bad words no matter the case
- Want to replace bad words with random good words
- You have a correctly escaped list of incorrect words: see http://php.net/preg_quote
You can easily use the PHP function preg_replace_callback :
$input_string = 'This Could be interesting but should it be? Perhaps this \'would\' work; or couldn\'t it?'; $bad_words = array('could', 'would', 'should'); $good_words = array('might', 'will'); function replace_words($matches){ global $good_words; return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3]; } echo preg_replace_callback('/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', 'replace_words', $input_string);
So what preg_replace_callback does, it compiles a regex pattern consisting of all the bad words. Matches will be in the format:
/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i
The i modifier makes the case insensitive, so both bad and bad match.
The replace_words function replace_words takes the matched word and its boundaries (either blank or white space), and replaces it with borders and a random good word.
global $good_words; <-- Makes the $good_words variable accessible from within the function $matches[1] <-- The word boundary before the matched word $matches[3] <-- The word boundary after the matched word $good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words
Anonymous function
You can rewrite the above as one insert using the anonymous function in preg_replace_callback
echo preg_replace_callback( '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', function ($matches) use ($good_words){ return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3]; }, $input_string );
Function wrapper
If you intend to use it several times, you can also write it as a standalone function, although in this case you most likely want to pass good / bad words to the function when you call it (or hard code them all the time), but it depends how you output them ...
function clean_string($input_string, $bad_words, $good_words){ return preg_replace_callback( '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', function ($matches) use ($good_words){ return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3]; }, $input_string ); } echo clean_string($input_string, $bad_words, $good_words);
Exit
Performing the above functions in sequence with the input list and words shown in the first example:
This will be interesting but might it be? Perhaps this 'will' work; or couldn't it? This might be interesting but might it be? Perhaps this 'might' work; or couldn't it? This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?
Of course, the replacement words are chosen randomly, so if I refresh the page, I would get something else ... But this shows what it does / does not replace.
NB
Hiding $bad_words
foreach($bad_words as $key=>$word){ $bad_words[$key] = preg_quote($word); }
Word Boundaries \b
In this code, I used \b , \s and ^ or $ as word boundaries, there is a good reason for this. While white space , start of string and end of string are considered word boundaries \b will not coincide in all cases, for example:
\b\$h1t\b <---Will not match
This is because \b matches characters without a word (ie [^a-zA-Z0-9] ), and characters like $ not considered characters of a word.
miscellanea
Depending on the size of the word list, there are several potential hiccups. From the point of view of the design of the system as a whole, the poor form has a huge number of regular expressions for several reasons:
- Hard to maintain
- It is hard to read / understand what he is doing
- Hard to find errors
- It can be intense in memory if the list is too long.
Given that the regex pattern is compiled by PHP , the first reason is denied. The second should also be negative; if you are a large list of words with a dozen permutations of each bad word, then I suggest you stop and rethink your approach (read: use the marking / moderation system).
To clarify, I do not see a problem with a small list of words to filter out specific curses, as this serves the purpose of: stopping users from flash from each other; The problem occurs when you try to filter out too much , including permutations. Adhere to filtering ordinary abusive words, and if this does not work, then - for the last time - implement a marking / deceleration system.