What does a DOMDocument do for my string?

Question

What does a DOMDocument do for my string?

$dom = new DOMDocument('1.0', 'UTF-8');

$str = '<p>Hello®</p>';

var_dump(mb_detect_encoding($str)); 

$dom->loadHTML($str);

var_dump($dom->saveHTML());

View .

Outputs

string(5) "UTF-8"

string(158) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello&Acirc;&reg;</p></body></html>
"

Why is my Unicode ®converted to Â®and how to stop it?

^{Am I losing my mind today?}

+5

php unicode domdocument

alex Feb 21 '11 at 5:49

source share

3 answers

You can add an xml encoding tag (and output it later). This works for me on things that are not the foundation of Centos 5.x (ubuntu, cpanel php):

<?php
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str)); 
$dom->loadHTML('<?xml encoding="utf-8">'.$str);
var_dump($dom->saveHTML());

, :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&reg;</p></body></html>

, :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&Acirc;&reg;</p></body></html>

+4

Jan 23 . '12 18:46

I fixed this UTF-8 decoding before transferring it to HTML loading.

$dom->loadHTML( utf_decode( $html ) );

saveHTML()It seems to decode special characters, such as German umlauts, to their HTML objects. (Although I installed $dom->substituteEntities=false;... oO)

This is rather strange, although the documentation states:

The DOM extension uses UTF-8 encoding.

(http://www.php.net/manual/de/class.domdocument.php, search utf8)

Oh dear, PHP coding creates problems over and over ... never ending the story.

+2

graup Jun 11 '12 at 13:02

source share

Ignacio Vazquez-Abrams · Accepted Answer · 2011-02-21T05:54:44+0000

Your text editor speaks "®"UTF-8, but the bytes in the file speak "Â®"Latin-1 (or a similar encoding), which PHP uses to read it. Using a character entity reference will eliminate this ambiguity.

>>> print u'®'.encode('utf-8').decode('latin-1')
Â®

What does a DOMDocument do for my string?

Outputs

More articles: