Hope this is still relevant! I do the same with Eminem lyrics, but from lyrics.com. Should it be from Rap Genius? I found lyrics.com to make it easier to scratch.
To get Andre 3000, just change the code accordingly.
Here is my code; he gets links to songs, and then rattles these pages to get the lyrics and adds the lyrics to the list:
import re import requests import nltk from bs4 import BeautifulSoup url = 'http://www.lyrics.com/eminem' r = requests.get(url) soup = BeautifulSoup(r.content) gdata = soup.find_all('div',{'class':'row'}) eminemLyrics = [] for item in gdata: title = item.find_all('a',{'itemprop':'name'})[0].text lyricsdotcom = 'http://www.lyrics.com' for link in item('a'): try: lyriclink = lyricsdotcom+link.get('href') req = requests.get(lyriclink) lyricsoup = BeautifulSoup(req.content) lyricdata = lyricsoup.find_all('div',{'id':re.compile('lyric_space|lyrics')})[0].text eminemLyrics.append([title,lyricdata]) print title print lyricdata print except: pass
This will give you the lyrics on the list. To print all the headers:
titles = [i[0] for i in eminemLyrics] print titles
To get a specific song:
titles.index('Cleaning out My Closet') 120
To tokenize a song, insert this value ( 120 ) in:
song = nltk.word_tokenize(eminemLyrics[120][1]) nltk.pos_tag(song)
tmthyjames
source share