Python UnicodeEncodeError> How can I just remove the alarm Unicode characters?

Here is what I did.

>>> soup = BeautifulSoup (html) >>> soup Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128) >>> >>> soup.find('div') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128) >>> >>> soup.find('span') <span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span> >>> 

How can I just remove the alarming Unicode characters from html ?
Or are there any cleaner solutions?

+6
source share
4 answers

Try as follows: soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

+10
source

The error you see is due to the fact that repr(soup) trying to mix Unicode and bytestrings. Mixing Unicode and bytes often leads to errors.

For comparison:

 >>> u'1' + '©' Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) 

and

 >>> u'1' + u'©' u'1\xa9' >>> '1' + u'©' u'1\xa9' >>> '1' + '©' '1\xc2\xa9' 

Here is an example for classes:

 >>> class A: ... def __repr__(self): ... return u'copyright ©'.encode('utf-8') ... >>> A() copyright © >>> class B: ... def __repr__(self): ... return u'copyright ©' ... >>> B() Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi nal not in range(128) #' workaround highlighting bug >>> class C: ... def __repr__(self): ... return repr(A()) + repr(B()) ... >>> C() Traceback (most recent call last): File "<input>", line 1, in <module> File "<input>", line 3, in __repr__ UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi nal not in range(128) 

A similar thing happens with BeautifulSoup :

 >>> html = """<p>©""" >>> soup = BeautifulSoup(html) >>> repr(soup) Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin al not in range(128) 

To get around this:

 >>> unicode(soup) u'<p>\xa9</p>' >>> str(soup) '<p>\xc2\xa9</p>' >>> soup.encode('utf-8') '<p>\xc2\xa9</p>' 
+2
source

First of all, the “harassing” Unicode characters may be letters in some language, but provided you don't have to worry about non-English characters, then you can use python lib to convert unicode to ansi. Check the answer to this question: How to convert file format from Unicode to ASCII using Python?

The accepted answer seems like a good solution (which I did not know about in advance).

+1
source

I had the same problem, the hours spent on it. Note that an error occurs whenever the interpreter needs to display content because the interpreter is trying to convert to ascii, which causes problems. Take a look at the main answer here:

UnicodeEncodeError using BeautifulSoup 3.1.0.1 and Python 2.5.2

0
source

All Articles