The BeautifulSoup.text method returns text without delimiters (\ n, \ r, etc.).

I tried to parse the lyrics from the largest Russian-language website http://amalgama-lab.com and save the lyrics (translated and original) to the audio list of my Vkontakte account (unfortunately, amalgama does not have an API)

import urllib from BeautifulSoup import BeautifulSoup import vkontakte vk = vkontakte.API(token=<SECRET_TOKEN>) audios = vk.getAudios(count='2') #{u'artist': u'The Beatles', u'url': u'http://cs4519.vkontakte.ru/u4665445/audio/4241af71a888.mp3', u'title': u'Yesterday', u'lyrics_id': u'2365986', u'duration': 130, u'aid': 166194990, u'owner_id': 173505924} url = 'http://amalgama.mobi/songs/' for i in audios: print i['artist'] if i['artist'].startswith('The '): url += i['artist'][4:5] + '/' + i['artist'][4:].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html' else: url += i['artist'][:1] + '/' + i['artist'].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html' url = url.lower() page = urllib.urlopen(url) soup = BeautifulSoup(page.read(), fromEncoding="utf-8") texts = soup.findAll('ol', ) if len(texts) != 0: en = texts[0].text #this! ru = texts[1].text #this! vk.get('audio.edit', aid=i['aid'], oid = i['owner_id'], artist=i['artist'], title = i['title'], text = ru, no_search = 0) 

but .text returns a string without separators:

“Yesterday all my troubles seemed so far away. Now everything looks like they are here to stay. I believe in yesterday. Suddenly I’m not half the person I once was. There a shadow hanging over me suddenly appeared yesterday [Chorus:] Why she had to leave, I don’t know, she wouldn’t say that I said something wrong, now I want it yesterday, love was such an easy game to play, now I need a place to hide. Oh, I believe in "

This is the main problem. Further, what is the best way to save texts this way:

Lyric 1 (Original)

Lyric 1 (translated)

Lyrics 2 (Original)

Lyrics 2 (translated)

Lyrics 3 (Original)

Lyrics 3 (translated)

...

? I only get dirty code. Thanks

+7
source share
3 answers

Try the separator parameter of the get_text method:

 from bs4 import BeautifulSoup html = '''<p> Hi. This is a simple example.<br>Yet poweful one. <p>''' soup = Beautifulsoup(html) soup.get_text() # Output: u' Hi. This is a simple example.Yet poweful one. ' soup.get_text(separator=' ') # Output: u' Hi. This is a simple example. Yet poweful one. ' 
+8
source
+5
source

You can do it:

 soup = BeautifulSoup(html) ols = soup.findAll('ol') # for the two languages for ol in ols: ps = ol.findAll('p') for p in ps: for item in p.contents: if str(item)!='<br />': print str(item) 
0
source

All Articles