Setting the encoding for the sax parser in Python

When I pass the utf-8 encoded xml code to an ExpatParser instance:

def test(filename): parser = xml.sax.make_parser() with codecs.open(filename, 'r', encoding='utf-8') as f: for line in f: parser.feed(line) 

... I get the following:

 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "test.py", line 72, in search_test parser.feed(line) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed self._parser.Parse(data, isFinal) UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128) 

I probably miss something obvious. How to change the parser encoding from 'ascii' to 'utf-8'?

+6
python unicode sax
source share
5 answers

Your code does not work in Python 2.6, but works in version 3.0.

This works in version 2.6, apparently because it allows the parser to determine the encoding (perhaps by reading the encoding optionally specified in the first line of the XML file, and, otherwise, the default is utf-8):

 def test(filename): parser = xml.sax.make_parser() parser.parse(open(filename)) 
+5
source share

The SAX analyzer in Python 2.6 should be able to parse utf-8 without manipulating it. Despite the fact that you missed the ContentHandler that you use with the parser, if this content handler tries to print any characters other than ascii on your console, it will crash.

For example, let's say I have this XML document:

 <?xml version="1.0" encoding="utf-8"?> <test> <name>Champs-Γ‰lysΓ©es</name> </test> 

And this syntax apparatus:

 import xml.sax class MyHandler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): print "StartElement: %s" % name def endElement(self, name): print "EndElement: %s" % name def characters(self, ch): #print "Characters: '%s'" % ch pass parser = xml.sax.make_parser() parser.setContentHandler(MyHandler()) for line in open('text.xml', 'r'): parser.feed(line) 

It will be well understood, and the content will really preserve accented characters in XML. The only problem is the line in def characters() , which I commented on. Running the console in Python 2.6 will give you the exception you see, because the print function must convert characters to ascii for output.

You have 3 possible solutions:

One . Make sure your terminal supports unicode, then create a sitecustomize.py entry in site-packages and set the default character set for utf-8:

import sys sys.setdefaultencoding ('UTF-8')

Two : do not print the output to the terminal (tongue on the cheek)

Three : normalize the output with unicodedata.normalize to convert non-ascii characters to ascii equivalents or encode characters for ascii to output text: ch.encode('ascii', 'replace') . Of course, using this method, you will not be able to correctly evaluate the text.

Using one option above, your code worked fine for mine in Python 2.5.

+5
source share

Yarret Hardy has already explained this problem. But for those of you who encode the command line and don't seem to see "sys.setdefaultencoding", quick work around this error (or "function"):

 import sys reload(sys) sys.setdefaultencoding('utf-8') 

Hope reload(sys) doesn't break anything else.

More on this old blog:

Sign setdefaultencoding

+5
source share

To specify a custom file encoding for SAX parsing, you can use InputSource as follows:

 def test(filename, encoding): parser = xml.sax.make_parser() with open(filename, "rb") as f: input_source = xml.sax.xmlreader.InputSource() input_source.setByteStream(f) input_source.setEncoding(encoding) parser.parse(input_source) 

This allows you to parse an XML file with non-ASCII encoding without UTF8. For example, you can parse an extended ASCII file encoded with LATIN1, for example: test(filename, "latin1")

(Added this answer to directly address the title of this question, as it tends to be highly rated by search engines.)

+3
source share

Commenting on janpf's answer (sorry, I don't have enough reputation to put it there), note that the Janpf version will violate IDLE, which requires its own stdout, etc., other than sys default. Therefore, I suggest changing the code as something like:

 import sys currentStdOut = sys.stdout currentStdIn = sys.stdin currentStdErr = sys.stderr reload(sys) sys.setdefaultencoding('utf-8') sys.stdout = currentStdOut sys.stdin = currentStdIn sys.stderr = currentStdErr 

There may be other variables to save, but they seem to be the most important.

0
source share

All Articles