Determining the correct character encoding in PHP?

Question

Determining the correct character encoding in PHP?

I am trying to detect character encoding of a string, but I cannot get the correct result.
For instance:

$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ; $str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ; // Now $str should be a Windows-1252-encoded string. // Let detect its encoding: echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

This code outputs ISO-8859-1 , but it should be Windows-1252 .

What happened to this?

EDIT:
Updated example, in response to @ raina77ow.

 $str = "&euro;&sbquo;&fnof;&bdquo;&hellip;" ; // no white-spaces $str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ; $str = "Hello $str" ; // let add some ascii characters echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

I get the wrong result again.

+4

php character-encoding multibyte detection

Getfree Apr 05 '13 at 21:57

source share

2 answers

scy · Answer 1 · 2014-04-23T13:46:37+0000

The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters from 0x80 to 0x9f, it will not be detected as Windows-1252.

This means that if your string contains a normal ASCII letter of type "A" or even a space character, PHP will say that it is not valid Windows-1252 and, in your case, returns to the next possible encoding that conforms to the ISO 8859-1 standard. This is a PHP error, see https://bugs.php.net/bug.php?id=64667 .

rr- · Answer 2 · 2013-04-05T22:01:28+0000

Although strings encoded with ISO-8859-1 and CP-1252 have a different representation of the byte code:

 <?php $str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ; foreach (array('Windows-1252', 'ISO-8859-1') as $encoding) { $new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES'); printf('%15s: %s detected: %10s explicitly: %10s', $encoding, implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))), mb_detect_encoding($new), mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252')) ); echo PHP_EOL; }

Results:

 Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1 ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1

... from what we see here, it looks like there is a problem with the second parameter mb_detect_encoding . Using mb_detect_order instead of a parameter gives very similar results.

Determining the correct character encoding in PHP?

More articles: