How to convert Unicode text to plain text

Question

How to convert Unicode text to plain text

I am learning beautiful soup in Python.

I am trying to parse a simple web page with a list of books.

eg

<a href="https://www.nostarch.com/carhacking">The Car Hacker's Handbook</a>

I am using the code below.

 import requests, bs4 res = requests.get('http://nostarch.com') res.raise_for_status() nSoup = bs4.BeautifulSoup(res.text,"html.parser") elems = nSoup.select('.product-body a') #elems[0] gives <a href="https://www.nostarch.com/carhacking">The Car Hacker\u2019s Handbook</a>

and

 #elems[0].getText() gives u'The Car Hacker\u2019s Handbook'

But I want the correct text to be set,

 s = elems[0].getText() print s >>>The Car Hacker's Handbook

How do I change my code to give a "Guide for Car Hackers" instead of "u'The Car Hacker \ u2019s Handbook"?

Please help.

+6

python unicode web-scraping ascii beautifulsoup

CS_noob Apr 14 '16 at 12:55

source share

2 answers

you can try

 import unicodedata def normText(unicodeText): return unicodedata.normalize('NFKD', unicodeText).encode('ascii','ignore')

This converts the unicodetext to plain text and you can write it to a file.

-2

Anil pediredla Apr 14 '16 at 14:29

source share

mschuh · Accepted Answer · 2016-04-14T13:07:55+0000

Have you tried using the encoding method?

 elems[0].getText().encode('utf-8')

More about unicode and python can be found at https://docs.python.org/2/howto/unicode.html

Alternatively, to find out if your string is actually utf-8 encoded, you can use chardet and run the following command:

 >>> import chardet >>> chardet.detect(elems[0].getText()) {'confidence': 0.5, 'encoding': 'utf-8'}

How to convert Unicode text to plain text

More articles: