Deleting invalid / incomplete multibyte characters

Question

Deleting invalid / incomplete multibyte characters

I am having some problems using the following code when entering user:

htmlentities($string, ENT_COMPAT, 'UTF-8');

When an invalid multibyte character is detected, PHP issues a notification:

PHP Warning: htmlentities (): Invalid multibyte sequence in argument in /path/to/file.php on line 123

My first thought was to suppress the error, but this is a slow and bad practice: http://derickrethans.nl/five-reasons-why-the-shutop-operator-should-be-avoided.html

My second thought was to use the ENT_IGNORE flag, but even in the PHP manual it is suggested not to use this:

Disable invalid code block sequences instead of returning an empty string. Using this flag is not recommended, as it may have security implications .

Another reason led me to the following piece of code:

  // detect encoding $encoding = mb_detect_encoding($query); if($encoding != 'UTF-8') { $query = mb_convert_encoding($query, 'UTF-8', $encoding); } else { // strip out invalid utf8 sequences $query = iconv('UTF-8', 'UTF-8//IGNORE', $query); }

Unfortunately, iconv also throws E_NOTICE when it removes / ignores invalid characters:

If you add the // TRANSLIT line to out_charset, transliteration is activated. This means that when a character cannot be represented in the target encoding, it can be approximated by one or more similar characters. If you add the // IGNORE line, characters that cannot be represented in the target encoding are silently discarded. Otherwise, str is truncated from the first invalid character and E_NOTICE is generated.

So, I basically have no options. I would rather use a tried and tested library to process this kind of material, except to try to use it with some of the regular expression solutions I've seen floating around.

So this leads me to my last question: How can I remove invalid multibyte characters efficiently, reliably, without notifications / warnings / errors?

+8

php utf-8 iconv

Dean Mar 09 '12 at 8:59

source share

2 answers

iconv('UTF-8', "ISO-8859-1//IGNORE", $string);

worked very well for me. It does not seem to generate any notification.

+4

Nicholas pickering Mar 14 '13 at 16:44

source share

hakre · Accepted Answer · 2012-03-10T23:52:35+0000

How can I remove invalid multibyte characters efficiently, reliably, without notifications / warnings / errors?

Well, as you have already indicated in your question yourself ( or at least related ), deleting invalid byte sequences (s) is not an option.

Instead, it should be replaced with the replacement character U + FFFD. Starting with PHP 5.4.0 you can use the ENT_SUBSTITUTE flag for htmlentities . This is probably most secure if you do not want to reject the string.

iconv will always give you a warning in recent versions of PHP, even if it doesn’t delete the entire line. Thus, this does not look like a good alternative for you.

Deleting invalid / incomplete multibyte characters

More articles: