DOMDocument seems to convert Chinese characters to codes [...]. How can I save Chinese or another foreign language as it is, instead of converting them into codes?
$dom = new DOMDocument(); $dom->loadHTML($html);
If you use the loadHTML function to load an HTML fragment. By default, DOMDocument expects the string to be in the default HTML encoding ( ISO-8859-1 ), however most often charset (sic!) Is the meta information provided next to the string you use, and not inside. To make this more complex, this meta information will even be inside the line.
In any case, since you did not share the HTML string data and you did not specify the encoding, it is difficult to say exactly what is happening.
I assume that HTML is encoded in UTF-8 encoding, but this is not signaled inside the HTML string. So the following work may help:
$doc = new DOMDocument(); $doc->loadHTML('<?xml encoding="UTF-8">' . $html); // dirty fix foreach ($doc->childNodes as $item) if ($item->nodeType == XML_PI_NODE) $doc->removeChild($item); // remove hack $doc->encoding = 'UTF-8'; // insert proper
It introduces a hint from the very beginning (and deletes it after loading the HTML). From now on, DOMDocument will return UTF-8 (as always).
hakre May 31 '12 at 13:50 2012-05-31 13:50
source share