http is not defined in terms of a specific character encoding and uses octets instead. You need to convert your data to an encoding, and then you need to tell the server which encoding you used. Allows you to use utf8, since this is usually the best choice:
This data is a bit like XML, but you skip the xml tag. Some services may accept this, but you should not in any case. In fact, the coding is actually there; so make sure you turn it on. The header looks like <?xml version="1.0" encoding=" encoding "?> .
s = u"עברית" data_unicode = u"""<?xml version="1.0" encoding="UTF-8"?> <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0"> <text>%s</text> </spellrequest> """ % s data_octets = data_unicode.encode('utf-8')
As a courtesy, you should also tell the server itself the format and encoding with the content-type header:
con = httplib.HTTPSConnection("www.google.com") con.request("POST", "/tbproxy/spell?lang=he", data_octets, {'content-type': 'text/xml; charset=utf-8'})
EDIT: it works fine on my machine, are you sure you aren't missing something? full example
>>> from cgi import escape >>> from urllib import urlencode >>> import httplib >>> >>> template = u"""<?xml version="1.0" encoding="UTF-8"?> ... <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0"> ... <text>%s</text> ... </spellrequest> ... """ >>> >>> def chkspell(word, lang='en'): ... data_octets = (template % escape(word)).encode('utf-8') ... con = httplib.HTTPSConnection("www.google.com") ... con.request("POST", ... "/tbproxy/spell?" + urlencode({'lang': lang}), ... data_octets, ... {'content-type': 'text/xml; charset=utf-8'}) ... req = con.getresponse() ... return req.read() ... >>> chkspell('baseball') '<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="8"></spellresult>' >>> chkspell(corpus, 'he') '<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="5"></spellresult>'
I noticed that when I pasted your example, it displays in the opposite order on my terminal from the way it displays in my browser. Not too surprising, considering that Hebrew is a language from right to left.
>>> corpus = u"עברית" >>> print corpus[0] ע
SingleNegationElimination
source share