Convert unicode using utf-8 string as content for str

Question

Convert unicode using utf-8 string as content for str

I use pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'}) content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

 u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how can i convert it to str without losing content?

to make it clear:

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

+8

python unicode python-2.x utf-8 pyquery

wong2 Jan 26 '13 at 17:55

source share

1 answer

Martijn pieters · Accepted Answer · 2013-01-26T18:18:30+0000

If you have a unicode value with UTF-8 bytes, encode Latin-1 to save the "bytes":

 content = content.encode('latin1')

since Unicode codes U + 0000 to U + 00FF are all mapped one on one with latin-1 encoding; thus, this encoding interprets your data as alphabetic bytes.

For your example, this gives me:

 >>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' >>> content.encode('latin1') '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' >>> content.encode('latin1').decode('utf8') u'\u5c42\u53e0\u6837\u5f0f\u8868' >>> print content.encode('latin1').decode('utf8')层叠样式表

Convert unicode using utf-8 string as content for str

More articles: