XML file parsing gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

I submit a GET request to CareerBuilder API :

import requests url = "http://api.careerbuilder.com/v1/jobsearch" payload = {'DeveloperKey': 'MY_DEVLOPER_KEY', 'JobTitle': 'Biologist'} r = requests.get(url, params=payload) xml = r.text 

And return the XML that looks like this . However, it is difficult for me to make it out.

Using lxml

 >>> from lxml import etree >>> print etree.fromstring(xml) Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> print etree.fromstring(xml) File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311) File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625) ValueError: Unicode strings with encoding declaration are not supported. 

or ElementTree:

 Traceback (most recent call last): File "<pyshell#3>", line 1, in <module> print ET.fromstring(xml) File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML parser.feed(text) File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed self._parser.Parse(data, 0) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128) 

So, although the XML file starts with

 <?xml version="1.0" encoding="UTF-8"?> 

I get the impression that it contains characters that are not allowed. How to lxml this file using lxml or ElementTree ?

+6
source share
2 answers

The value of the decoded Unicode is used. Instead of r.raw raw answers :

 r = requests.get(url, params=payload, stream=True) r.raw.decode_content = True etree.parse(r.raw) 

which will directly read data from the response; note the stream=True parameter on .get() .

Setting the r.raw.decode_content = True flag ensures that a raw socket will give you compressed content, even if the response is gzip or deflate compressed.

You do not need to send a response; for small XML documents, it’s great to use the response.content attribute, which is the non-decoded response body:

 r = requests.get(url, params=payload) xml = etree.fromstring(r.content) 

XML parsers always expect bytes as input, because the XML format itself defines how the parser should decode these bytes into Unicode text.

+16
source

Correction!

See below how I misunderstood everything. Basically, when we use the .text method, then the result is a unicode encoded string. With it, the following exception occurs in lxml

ValueError: Unicode strings with encoding declaration are not supported. Please use byte input or XML fragments without declaration.

Which basically means that @ martijn-pieters was right, we should use the original answer returned by .content

Invalid answer (but may be of interest to someone)

For anyone. I believe that the cause of this error is probably an invalid assumption made by the requests, as described in the Response.text documentation :

Unicode response content.

If Response.encoding is None, the encoding will be guessed using the chart.

The encoding of the response content is determined only based on the HTTP headers, following RFC 2616 per email. If you can take advantage of knowledge other than HTTP to better understand encoding, you must set r.encoding appropriately before accessing this property.

So, after that, you can also make sure that r.text requests correctly encode the response content by explicitly setting the encoding with r.encoding = 'UTF-8'

This approach adds one more check that the received answer really is in the correct coding before its analysis with the help of lxml.

+5
source

All Articles