In some XML files that I process (often RSS), I look at text containing characters like Today's Newest , which becomes Today’s Newest after I extract the text from node. This suggests that I am not properly processing the decoding process.
I could just fix my script to fix this error, but what if there are many other characters that get garbled? What is the proper way to digest XML files without iterating over encoding when converting it to a UTF-8 script?
Here are some of the things I've tried that don't seem to work:
$xml = file_get_contents($file); // One: still contains ’ //$xml = @iconv('UTF-8', 'UTF-8//IGNORE', $xml); // Two: LibXMLError Entity 'rsquo' not defined //$xml = htmlentities($xml, null, 'UTF-8'); //$xml = htmlspecialchars_decode($xml, ENT_QUOTES); // Three: still contains ’ //$xml = mb_convert_encoding($xml, "UTF-8", "UTF-8"); $xml = simplexml_load_string($xml, null, LIBXML_NOCDATA | LIBXML_NOENT);
source share