The error you see is due to the fact that repr(soup) trying to mix Unicode and bytestrings. Mixing Unicode and bytes often leads to errors.
For comparison:
>>> u'1' + '©' Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
and
>>> u'1' + u'©' u'1\xa9' >>> '1' + u'©' u'1\xa9' >>> '1' + '©' '1\xc2\xa9'
Here is an example for classes:
>>> class A: ... def __repr__(self): ... return u'copyright ©'.encode('utf-8') ... >>> A() copyright © >>> class B: ... def __repr__(self): ... return u'copyright ©' ... >>> B() Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi nal not in range(128) #' workaround highlighting bug >>> class C: ... def __repr__(self): ... return repr(A()) + repr(B()) ... >>> C() Traceback (most recent call last): File "<input>", line 1, in <module> File "<input>", line 3, in __repr__ UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi nal not in range(128)
A similar thing happens with BeautifulSoup :
>>> html = """<p>©""" >>> soup = BeautifulSoup(html) >>> repr(soup) Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin al not in range(128)
To get around this:
>>> unicode(soup) u'<p>\xa9</p>' >>> str(soup) '<p>\xc2\xa9</p>' >>> soup.encode('utf-8') '<p>\xc2\xa9</p>'
jfs
source share