How to convert Unicode text to plain text

I am learning beautiful soup in Python.

I am trying to parse a simple web page with a list of books.

eg

<a href="https://www.nostarch.com/carhacking">The Car Hacker's Handbook</a> 

I am using the code below.

 import requests, bs4 res = requests.get('http://nostarch.com') res.raise_for_status() nSoup = bs4.BeautifulSoup(res.text,"html.parser") elems = nSoup.select('.product-body a') #elems[0] gives <a href="https://www.nostarch.com/carhacking">The Car Hacker\u2019s Handbook</a> 

and

 #elems[0].getText() gives u'The Car Hacker\u2019s Handbook' 

But I want the correct text to be set,

 s = elems[0].getText() print s >>>The Car Hacker's Handbook 

How do I change my code to give a "Guide for Car Hackers" instead of "u'The Car Hacker \ u2019s Handbook"?

Please help.

+6
source share
2 answers

Have you tried using the encoding method?

 elems[0].getText().encode('utf-8') 

More about unicode and python can be found at https://docs.python.org/2/howto/unicode.html

Alternatively, to find out if your string is actually utf-8 encoded, you can use chardet and run the following command:

 >>> import chardet >>> chardet.detect(elems[0].getText()) {'confidence': 0.5, 'encoding': 'utf-8'} 
+3
source

you can try

 import unicodedata def normText(unicodeText): return unicodedata.normalize('NFKD', unicodeText).encode('ascii','ignore') 

This converts the unicodetext to plain text and you can write it to a file.

-2
source

All Articles