PHP, MySQL and XML = garbled HTML output

I have a field in MySQL text type using the following sort: utf8_general_ci .

This XML field is populated using a variable built using DOMDocument:

 function ed_audit_node($dom, $field, $new, $old){ //create audit_detail node $ad = $dom->createElement('audit_detail'); $fn = $dom->createElement('fieldname'); $fn->appendChild($dom->createTextNode($field)); $ad->appendChild($fn); $ov = $dom->createElement('old_value'); $ov->appendChild($dom->createTextNode($old)); $ad->appendChild($ov); $nv = $dom->createElement('new_value'); $nv->appendChild($dom->createTextNode($new)); $ad->appendChild($nv); //append to document return $ad; } 

Here is how I can save on db ($ xml comes from $ dom-> saveXML ()):

 function ed_audit_insert($ed, $xml){ global $visitor; $sql = <<<EOF INSERT INTO ed.audit (employee_id, audit_date, audit_action, audit_data, user_id) VALUES ( {$ed[emp][employee_id]}, now(), '{$ed[audit_action]}', '{$xml}', {$visitor[user_id]} ); EOF; $req = mysql_query($sql,$ed['db']) or die(db_query_error($sql,mysql_error(),__FUNCTION__)); //snip } 

See the old, parallel, slightly related thread on how Im creates this XML: Another PHP XML parsing error: "The input is not valid UTF-8, specify the encoding!"

What works : - querying the database, selecting a field and displaying it using jQuery (.ajax ()) and filling in a text field. Firebug and textarea correspond to what is in the database (confirmed by Toad).

What does not work : - text output from the database to the HTML page. This HTML page has an ISO-8859-1 content type that I cannot change.

Here is the code that displays this:

 $xmlData = simplexml_load_string($d['audit_data']); foreach ($xmlData->audit_detail as $a){ echo "<p> straight from db = ".$a->new_value."</p>"; echo "<p> utf8_decode() = ".utf8_decode($a->new_value)."</p>"; } 

Ive also used the Charset Extender for Firefox: tried unsuccessfully to use ISO-8859-1, UTF-8, and 1252.

If it was UTF-8, shouldn't I see diamonds with question marks inside (since this is content-type = ISO-8859-1)? If it is not UTF-8, what is it?

Edit # 1

Here's a snapshot of the other tests I did:

 $xmlData = simplexml_load_string($d['audit_data']); foreach ($xmlData->audit_detail as $a){ echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>"; echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>"; echo "<hr/>"; echo "<p> straight from db = <pre>".$a->new_value."</pre></p>"; echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>"; echo "<hr/>"; $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value); $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8'); echo "<p> iconv() = ".$iso88591_2."</p>"; echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>"; } 

Edit # 2

I added my own FF tag, xmp.

the code:

 $xmlData = simplexml_load_string($d['audit_data']); foreach ($xmlData->audit_detail as $a){ echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>"; echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>"; echo "<hr/>"; echo "<p> straight from db = <pre>".$a->new_value."</pre></p>"; echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>"; echo "<hr/>"; $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value); $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8'); echo "<p> iconv() = ".$iso88591_2."</p>"; echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>"; echo "<hr/>"; echo "<p>straight from db, using &lt;xmp&gt; = <xmp>".$a->new_value."</xmp></p>"; echo "<p>utf8_decode(), using &lt;xmp&gt; = <xmp>".utf8_decode($a->new_value)."</xmp></p>"; } 

Here are some meta tags from the page:

 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="dc.language" scheme="ISO639-2/T" content="eng" /> 

IMO, the last meta tag has no support.

Edit # 3

Source:

 <p>encoding is, straight from db, using mb_detect_encoding: UTF-8</p><p>encoding is, with utf8_decode, using mb_detect_encoding: ASCII</p><hr/><p> straight from db = <pre>Ro马eç ³é ¥n franê¡©s</pre></p><p> utf8_decode() = <pre>Ro?e??n fran?s</pre></p><hr/><p> iconv() = Ro</p><p> mb_convert_encoding() = Ro?e??n fran?s</p><hr/><p>straight from db, using &lt;xmp&gt; = <xmp>Ro马eç ³é ¥n franê¡©s</xmp></p><p>utf8_decode(), using &lt;xmp&gt; = <xmp>Ro?e??n fran?s</xmp></p> 

Change # 4

Here is the SQL statement going to db:

 INSERT INTO ed.audit (employee_id, audit_date, audit_action, audit_data, user_id) VALUES ( 75, now(), 'u', '<?xml version="1.0"?> <audit><audit_detail><fieldname>role_fra</fieldname><old_value>aRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s</old_value><new_value>bRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s</new_value></audit_detail></audit> ', 333 ); 

! Please note: text from this XML does not necessarily match the screenshots above.

Change # 5

Here is my new function that wraps the CDATA tag around my values ​​for the old_value and new_value nodes:

 function ed_audit_node($dom, $field, $new, $old){ //create audit_detail node $ad = $dom->createElement('audit_detail'); $fn = $dom->createElement('fieldname'); $fn->appendChild($dom->createTextNode($field)); $ad->appendChild($fn); $ov = $dom->createElement('old_value'); $ov->appendChild($dom->createCDATASection($old)); $ad->appendChild($ov); $nv = $dom->createElement('new_value'); $nv->appendChild($dom->createCDATASection($new)); $ad->appendChild($nv); //append to document return $ad; } 

I also added the encoding to the XML document:

 $dom = new DomDocument('1.0', 'UTF-8'); 

Here is my new simpleXML call:

 $xmlData = simplexml_load_string($d['audit_data'], "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA); 

I see CDATA tags in Toad. However, I still get the error message:

 Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 2: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0xE9 0x6C 0x65 in <snip> 

Change # 6

I just noticed that a jQuery call returns the correct accented characters in CDATA.

+6
xml php utf-8 character-encoding
source share
1 answer

Technically, your string is in UTF8, but characters encoded in HTML (when rendered by the browser) are not in UTF8. So &#xa869; is a valid UTF8 string, but the character displayed on the screen from a web browser is not valid UTF8.

I would also wrap your echo on the screen (last 2 lines in your example) as follows:

 echo "<p>straight from db = <xmp>".$a->new_value."</xmp></p>"; echo "<p>utf8_decode() = <xmp>".utf8_decode($a->new_value)."</xmp></p>"; 

This will clearly display the point I am doing above.

Edit:

The problem actually is the uncontrolled undocumented "Feature" in simplexml_load_string () for PHP. It automatically converts all characters from its XML entity form format to their actual char form. The only way around this is to use simplexml_load_string () as follows:

  $data = simplexml_load_string( '<?xml version="1.0" encoding="utf-8"?> <audit> <audit_detail> <fieldname>role_fra</fieldname> <old_value><![CDATA[aRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s]]></old_value> <new_value><![CDATA[bRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s]]></new_value> </audit_detail> </audit>', "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA ); print "<PRE>"; print_r($data); exit; 

You must wrap your elements in the <![CDATA[]]> tags, and then pass the LIBXML_NOCDATA parameter to an XML parser. This will cause things in the <![CDATA[]]> tags to be different as a String type, and PHP can correctly handle this outside of the SimpleXMLObject.

+1
source share

All Articles