I am having some problems using the following code when entering user:
htmlentities($string, ENT_COMPAT, 'UTF-8');
When an invalid multibyte character is detected, PHP issues a notification:
PHP Warning: htmlentities (): Invalid multibyte sequence in argument in /path/to/file.php on line 123
My first thought was to suppress the error, but this is a slow and bad practice: http://derickrethans.nl/five-reasons-why-the-shutop-operator-should-be-avoided.html
My second thought was to use the ENT_IGNORE flag, but even in the PHP manual it is suggested not to use this:
Disable invalid code block sequences instead of returning an empty string. Using this flag is not recommended, as it may have security implications .
Another reason led me to the following piece of code:
// detect encoding $encoding = mb_detect_encoding($query); if($encoding != 'UTF-8') { $query = mb_convert_encoding($query, 'UTF-8', $encoding); } else { // strip out invalid utf8 sequences $query = iconv('UTF-8', 'UTF-8//IGNORE', $query); }
Unfortunately, iconv also throws E_NOTICE when it removes / ignores invalid characters:
If you add the // TRANSLIT line to out_charset, transliteration is activated. This means that when a character cannot be represented in the target encoding, it can be approximated by one or more similar characters. If you add the // IGNORE line, characters that cannot be represented in the target encoding are silently discarded. Otherwise, str is truncated from the first invalid character and E_NOTICE is generated.
So, I basically have no options. I would rather use a tried and tested library to process this kind of material, except to try to use it with some of the regular expression solutions I've seen floating around.
So this leads me to my last question: How can I remove invalid multibyte characters efficiently, reliably, without notifications / warnings / errors?