The SAX analyzer in Python 2.6 should be able to parse utf-8 without manipulating it. Despite the fact that you missed the ContentHandler that you use with the parser, if this content handler tries to print any characters other than ascii on your console, it will crash.
For example, let's say I have this XML document:
<?xml version="1.0" encoding="utf-8"?> <test> <name>Champs-ΓlysΓ©es</name> </test>
And this syntax apparatus:
import xml.sax class MyHandler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): print "StartElement: %s" % name def endElement(self, name): print "EndElement: %s" % name def characters(self, ch):
It will be well understood, and the content will really preserve accented characters in XML. The only problem is the line in def characters() , which I commented on. Running the console in Python 2.6 will give you the exception you see, because the print function must convert characters to ascii for output.
You have 3 possible solutions:
One . Make sure your terminal supports unicode, then create a sitecustomize.py entry in site-packages and set the default character set for utf-8:
import sys sys.setdefaultencoding ('UTF-8')
Two : do not print the output to the terminal (tongue on the cheek)
Three : normalize the output with unicodedata.normalize to convert non-ascii characters to ascii equivalents or encode characters for ascii to output text: ch.encode('ascii', 'replace') . Of course, using this method, you will not be able to correctly evaluate the text.
Using one option above, your code worked fine for mine in Python 2.5.
Jarret hardie
source share