I have a field in MySQL text type using the following sort: utf8_general_ci .
This XML field is populated using a variable built using DOMDocument:
function ed_audit_node($dom, $field, $new, $old){ //create audit_detail node $ad = $dom->createElement('audit_detail'); $fn = $dom->createElement('fieldname'); $fn->appendChild($dom->createTextNode($field)); $ad->appendChild($fn); $ov = $dom->createElement('old_value'); $ov->appendChild($dom->createTextNode($old)); $ad->appendChild($ov); $nv = $dom->createElement('new_value'); $nv->appendChild($dom->createTextNode($new)); $ad->appendChild($nv); //append to document return $ad; }
Here is how I can save on db ($ xml comes from $ dom-> saveXML ()):
function ed_audit_insert($ed, $xml){ global $visitor; $sql = <<<EOF INSERT INTO ed.audit (employee_id, audit_date, audit_action, audit_data, user_id) VALUES ( {$ed[emp][employee_id]}, now(), '{$ed[audit_action]}', '{$xml}', {$visitor[user_id]} ); EOF; $req = mysql_query($sql,$ed['db']) or die(db_query_error($sql,mysql_error(),__FUNCTION__));
See the old, parallel, slightly related thread on how Im creates this XML: Another PHP XML parsing error: "The input is not valid UTF-8, specify the encoding!"
What works : - querying the database, selecting a field and displaying it using jQuery (.ajax ()) and filling in a text field. Firebug and textarea correspond to what is in the database (confirmed by Toad).
What does not work : - text output from the database to the HTML page. This HTML page has an ISO-8859-1 content type that I cannot change.
Here is the code that displays this:
$xmlData = simplexml_load_string($d['audit_data']); foreach ($xmlData->audit_detail as $a){ echo "<p> straight from db = ".$a->new_value."</p>"; echo "<p> utf8_decode() = ".utf8_decode($a->new_value)."</p>"; }
Ive also used the Charset Extender for Firefox: tried unsuccessfully to use ISO-8859-1, UTF-8, and 1252.
If it was UTF-8, shouldn't I see diamonds with question marks inside (since this is content-type = ISO-8859-1)? If it is not UTF-8, what is it?
Edit # 1
Here's a snapshot of the other tests I did:
$xmlData = simplexml_load_string($d['audit_data']); foreach ($xmlData->audit_detail as $a){ echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>"; echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>"; echo "<hr/>"; echo "<p> straight from db = <pre>".$a->new_value."</pre></p>"; echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>"; echo "<hr/>"; $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value); $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8'); echo "<p> iconv() = ".$iso88591_2."</p>"; echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>"; }
Edit # 2
I added my own FF tag, xmp.
the code:
$xmlData = simplexml_load_string($d['audit_data']); foreach ($xmlData->audit_detail as $a){ echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>"; echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>"; echo "<hr/>"; echo "<p> straight from db = <pre>".$a->new_value."</pre></p>"; echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>"; echo "<hr/>"; $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value); $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8'); echo "<p> iconv() = ".$iso88591_2."</p>"; echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>"; echo "<hr/>"; echo "<p>straight from db, using <xmp> = <xmp>".$a->new_value."</xmp></p>"; echo "<p>utf8_decode(), using <xmp> = <xmp>".utf8_decode($a->new_value)."</xmp></p>"; }
Here are some meta tags from the page:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="dc.language" scheme="ISO639-2/T" content="eng" />
IMO, the last meta tag has no support.
Edit # 3
Source:
<p>encoding is, straight from db, using mb_detect_encoding: UTF-8</p><p>encoding is, with utf8_decode, using mb_detect_encoding: ASCII</p><hr/><p> straight from db = <pre>Ro马eç ³é ¥n franê¡©s</pre></p><p> utf8_decode() = <pre>Ro?e??n fran?s</pre></p><hr/><p> iconv() = Ro</p><p> mb_convert_encoding() = Ro?e??n fran?s</p><hr/><p>straight from db, using <xmp> = <xmp>Ro马eç ³é ¥n franê¡©s</xmp></p><p>utf8_decode(), using <xmp> = <xmp>Ro?e??n fran?s</xmp></p>
Change # 4
Here is the SQL statement going to db:
INSERT INTO ed.audit (employee_id, audit_date, audit_action, audit_data, user_id) VALUES ( 75, now(), 'u', '<?xml version="1.0"?> <audit><audit_detail><fieldname>role_fra</fieldname><old_value>aRo马e砳頥n franꡩs</old_value><new_value>bRo马e砳頥n franꡩs</new_value></audit_detail></audit> ', 333 );
! Please note: text from this XML does not necessarily match the screenshots above.
Change # 5
Here is my new function that wraps the CDATA tag around my values for the old_value and new_value nodes:
function ed_audit_node($dom, $field, $new, $old){ //create audit_detail node $ad = $dom->createElement('audit_detail'); $fn = $dom->createElement('fieldname'); $fn->appendChild($dom->createTextNode($field)); $ad->appendChild($fn); $ov = $dom->createElement('old_value'); $ov->appendChild($dom->createCDATASection($old)); $ad->appendChild($ov); $nv = $dom->createElement('new_value'); $nv->appendChild($dom->createCDATASection($new)); $ad->appendChild($nv); //append to document return $ad; }
I also added the encoding to the XML document:
$dom = new DomDocument('1.0', 'UTF-8');
Here is my new simpleXML call:
$xmlData = simplexml_load_string($d['audit_data'], "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA);
I see CDATA tags in Toad. However, I still get the error message:
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 2: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0xE9 0x6C 0x65 in <snip>
Change # 6
I just noticed that a jQuery call returns the correct accented characters in CDATA.