How to send Unicode characters using httplib?

I am trying to send data to unicode using the httplib.request function:

 s = u"עברית" data = """ <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0"> <text>%s</text> </spellrequest> """ % s con = httplib.HTTPSConnection("www.google.com") con.request("POST", "/tbproxy/spell?lang=he", data) response = con.getresponse().read() 

However, this is my mistake:

 Traceback (most recent call last): File "C:\Scripts\iQuality\test.py", line 47, in <module> print spellFix(u"╫á╫נ╫¿╫ץ╫ר╫ץ") File "C:\Scripts\iQuality\test.py", line 26, in spellFix con.request("POST", "/tbproxy/spell?lang=%s" % lang, data) File "C:\Python27\lib\httplib.py", line 955, in request self._send_request(method, url, body, headers) File "C:\Python27\lib\httplib.py", line 989, in _send_request self.endheaders(body) File "C:\Python27\lib\httplib.py", line 951, in endheaders self._send_output(message_body) File "C:\Python27\lib\httplib.py", line 815, in _send_output self.send(message_body) File "C:\Python27\lib\httplib.py", line 787, in send self.sock.sendall(data) File "C:\Python27\lib\ssl.py", line 220, in sendall v = self.send(data[count:]) File "C:\Python27\lib\ssl.py", line 189, in send v = self._sslobj.write(data) UnicodeEncodeError: 'ascii' codec can't encode characters in position 97-102: or dinal not in range(128) 

Where am I mistaken?

+8
python unicode
source share
1 answer

http is not defined in terms of a specific character encoding and uses octets instead. You need to convert your data to an encoding, and then you need to tell the server which encoding you used. Allows you to use utf8, since this is usually the best choice:

This data is a bit like XML, but you skip the xml tag. Some services may accept this, but you should not in any case. In fact, the coding is actually there; so make sure you turn it on. The header looks like <?xml version="1.0" encoding=" encoding "?> .

 s = u"עברית" data_unicode = u"""<?xml version="1.0" encoding="UTF-8"?> <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0"> <text>%s</text> </spellrequest> """ % s data_octets = data_unicode.encode('utf-8') 

As a courtesy, you should also tell the server itself the format and encoding with the content-type header:

 con = httplib.HTTPSConnection("www.google.com") con.request("POST", "/tbproxy/spell?lang=he", data_octets, {'content-type': 'text/xml; charset=utf-8'}) 

EDIT: it works fine on my machine, are you sure you aren't missing something? full example

 >>> from cgi import escape >>> from urllib import urlencode >>> import httplib >>> >>> template = u"""<?xml version="1.0" encoding="UTF-8"?> ... <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0"> ... <text>%s</text> ... </spellrequest> ... """ >>> >>> def chkspell(word, lang='en'): ... data_octets = (template % escape(word)).encode('utf-8') ... con = httplib.HTTPSConnection("www.google.com") ... con.request("POST", ... "/tbproxy/spell?" + urlencode({'lang': lang}), ... data_octets, ... {'content-type': 'text/xml; charset=utf-8'}) ... req = con.getresponse() ... return req.read() ... >>> chkspell('baseball') '<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="8"></spellresult>' >>> chkspell(corpus, 'he') '<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="5"></spellresult>' 

I noticed that when I pasted your example, it displays in the opposite order on my terminal from the way it displays in my browser. Not too surprising, considering that Hebrew is a language from right to left.

 >>> corpus = u"עברית" >>> print corpus[0] ע 
+9
source

All Articles