Python UnicodeEncodeError> How can I just remove the alarm Unicode characters?

Question

Python UnicodeEncodeError> How can I just remove the alarm Unicode characters?

Here is what I did.

>>> soup = BeautifulSoup (html) >>> soup Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128) >>> >>> soup.find('div') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128) >>> >>> soup.find('span') <span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span> >>>

How can I just remove the alarming Unicode characters from html ?
Or are there any cleaner solutions?

+6

python parsing html-parsing unicode

Nullpoet Mar 08 '11 at 18:04

source share

4 answers

The error you see is due to the fact that repr(soup) trying to mix Unicode and bytestrings. Mixing Unicode and bytes often leads to errors.

For comparison:

 >>> u'1' + '©' Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

and

 >>> u'1' + u'©' u'1\xa9' >>> '1' + u'©' u'1\xa9' >>> '1' + '©' '1\xc2\xa9'

Here is an example for classes:

 >>> class A: ... def __repr__(self): ... return u'copyright ©'.encode('utf-8') ... >>> A() copyright © >>> class B: ... def __repr__(self): ... return u'copyright ©' ... >>> B() Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi nal not in range(128) #' workaround highlighting bug >>> class C: ... def __repr__(self): ... return repr(A()) + repr(B()) ... >>> C() Traceback (most recent call last): File "<input>", line 1, in <module> File "<input>", line 3, in __repr__ UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi nal not in range(128)

A similar thing happens with BeautifulSoup :

 >>> html = """<p>©""" >>> soup = BeautifulSoup(html) >>> repr(soup) Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin al not in range(128)

To get around this:

 >>> unicode(soup) u'<p>\xa9</p>' >>> str(soup) '<p>\xc2\xa9</p>' >>> soup.encode('utf-8') '<p>\xc2\xa9</p>'

+2

jfs Mar 09 '11 at 12:39

source share

First of all, the “harassing” Unicode characters may be letters in some language, but provided you don't have to worry about non-English characters, then you can use python lib to convert unicode to ansi. Check the answer to this question: How to convert file format from Unicode to ASCII using Python?

The accepted answer seems like a good solution (which I did not know about in advance).

+1

Karim Mar 08 '11 at 18:13

source share

I had the same problem, the hours spent on it. Note that an error occurs whenever the interpreter needs to display content because the interpreter is trying to convert to ascii, which causes problems. Take a look at the main answer here:

UnicodeEncodeError using BeautifulSoup 3.1.0.1 and Python 2.5.2

0

kolba329 Jan 2 '12 at 22:21

source share

esv · Accepted Answer · 2011-03-08T18:46:28+0000

Try as follows: soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Python UnicodeEncodeError> How can I just remove the alarm Unicode characters?

More articles: