DOMDocument Encoding Problems / Symbols

I am using a DOMDocument to control / modify the HTML before it is displayed on the page. This is just an html snippet, not a full page. My initial problem was that the whole French character got messed up, and I was able to fix it after a trial error. Now, it seems, there is only one problem: is the character transforming ?.

The code:

<?php $dom = new DOMDocument('1.0','utf-8'); $dom->loadHTML(utf8_decode($row->text)); //Some pretty basic modification here, not even related to text //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto. $row->text = utf8_encode(preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()))); ?> 

I know this is getting messy with utf8 decoding / encoding, but this is the only way to get it working so far. Here is an example line:

Entrance: Sans doute parce quil vient datteindre une date déterminante dans son spectaculaire cheminement

Exit: Sans doute parce qu? Il vient d? Atteindre une date d & eacute; terminante dans son spectacularire cheminement

If I find more details, I will add them. Thank you for your time and support!

+6
php utf-8 domdocument
source share
3 answers

Do not use utf8_decode . If your text is in UTF-8, pass it as such.

Unfortunately, DOMDocument is used by default for LATIN1 in the case of HTML. This behavior seems to be

  • If you retrieve a deleted document, it should infer the encoding from the headers
  • If the header was not sent or the file is local, find the appropriate meta-equiv
  • Otherwise, the default value is LATIN1.

Work example:

 <?php $s = <<<HTML <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> Sans doute parce qu'il vient d'atteindre une date déterminante dans son spectaculaire cheminement </body> </html> HTML; libxml_use_internal_errors(true); $d = new domdocument; $d->loadHTML($s); echo $d->textContent; 

And with XML (default is UTF-8):

 <?php $s = '<x>Sans doute parce qu'il vient d'atteindre une date déterminante'. 'dans son spectaculaire cheminement</x>'; libxml_use_internal_errors(true); $d = new domdocument; $d->loadXML($s); echo $d->textContent; 
+16
source share

loadHtml() does not always recognize the correct encoding specified in the Content-type HTTP-EQUIV meta tag.

If the DomDocument('1.0', 'UTF-8') and loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . $html) do not work, how are they not for me (PHP 5.3.13), try the following:

Add another <head> section immediately after opening the <html> with the correct HTTP-EQUIV meta tag for the content. Then call loadHtml() , then remove the optional <head> .

 // Ensure entire page is encoded in UTF-8 $encoding = mb_detect_encoding($body); $body = $encoding ? @iconv($encoding, 'UTF-8', $body) : $body; // Insert a head and meta tag immediately after the opening <html> to force UTF-8 encoding $insertPoint = false; if (preg_match("/<html.*?>/is", $body, $matches, PREG_OFFSET_CAPTURE)) { $insertPoint = mb_strlen( $matches[0][0] ) + $matches[0][1]; } if ($insertPoint) { $body = mb_substr( $body, 0, $insertPoint ) . "<head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>" . mb_substr( $body, $insertPoint ); } $dom = new DOMDocument(); // Suppress warnings for loading non-standard html pages libxml_use_internal_errors(true); $dom->loadHTML($body); libxml_use_internal_errors(false); // Now remove extra <head> 

See this article: http://devzone.zend.com/1538/php-dom-xml-extension-encoding-processing/

+7
source share

That was enough for me, other answers here were redundant. Given that I have an HTML document with an existing HEAD tag. The HEAD tags have no attributes, and I had no problem leaving an extra META tag in the HTML for my use case.

 $data = str_ireplace('<head>', '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />', $data); $document = new DOMDocument(); $document->loadHTML($data); 
+4
source share

All Articles