How to save Chinese or another foreign language as it is, instead of turning them into codes?

DOMDocument seems to convert Chinese characters to codes, like

你 的 乱发 will become ä½ çš„ä¹±å'

How can I save Chinese or another foreign language, instead of converting them to codes?

Below is my simple test,

 $dom = new DOMDocument(); $dom->loadHTML($html); 

If I add this below before loadHTML (),

 $html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"); 

I get

 你的乱发 

Although the hidden codes will be displayed as Chinese characters, 你的乱发 still not 你的乱发 what I am after ....

+4
php domdocument cjk
Apr 19 '12 at 21:41
source share
3 answers

DOMDocument seems to convert Chinese characters to codes [...]. How can I save Chinese or another foreign language as it is, instead of converting them into codes?

 $dom = new DOMDocument(); $dom->loadHTML($html); 

If you use the loadHTML function to load an HTML fragment. By default, DOMDocument expects the string to be in the default HTML encoding ( ISO-8859-1 ), however most often charset (sic!) Is the meta information provided next to the string you use, and not inside. To make this more complex, this meta information will even be inside the line.

In any case, since you did not share the HTML string data and you did not specify the encoding, it is difficult to say exactly what is happening.

I assume that HTML is encoded in UTF-8 encoding, but this is not signaled inside the HTML string. So the following work may help:

 $doc = new DOMDocument(); $doc->loadHTML('<?xml encoding="UTF-8">' . $html); // dirty fix foreach ($doc->childNodes as $item) if ($item->nodeType == XML_PI_NODE) $doc->removeChild($item); // remove hack $doc->encoding = 'UTF-8'; // insert proper 

It introduces a hint from the very beginning (and deletes it after loading the HTML). From now on, DOMDocument will return UTF-8 (as always).

+8
May 31 '12 at 13:50
source share

I just stumbled upon this thread while looking for a solution to a similar problem, after I loaded the html correctly and did the parsing with Xpath, etc ... my text ends as follows:

 &#20320;&#30340;&#20081;&#21457; 

this display is displayed fine in the HTML text, but will not display correctly in the style tag or script (for example, in the setting of Chinese fonts).

to fix this, reverse lauthiamkok:

 $html = mb_convert_encoding($html, "UTF-8", "HTML-ENTITIES"); 

If for some reason the first workaround does not work for you, try this conversion.

+2
Sep 14 '12 at 4:21
source share

I am sure that ä½ çš„ä¹±å' is actually Windows Latin 1 (not ASCII, there are no diacritics in ASCII). Somewhere along the way, your UTF-8 text was saved as Windows Latin 1 ....

0
May 21 '12 at 12:47
source share



All Articles