Rap Genius w / Python lyrics

I'm a little new to coding, and I tried to erase the Andre 3000 song from the genius Rap, http://genius.com/artists/Andre-3000 , using Beautiful Soup (Python library for pulling data from HTML and XML files). My ultimate goal is to have data in string format. Here is what I still have:

from bs4 import BeautifulSoup from urllib2 import urlopen artist_url = "http://rapgenius.com/artists/Andre-3000" def get_song_links(url): html = urlopen(url).read() # print html soup = BeautifulSoup(html, "lxml") container = soup.find("div", "container") song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")] print song_links get_song_links(artist_url) for link in soup.find_all('a'): print(link.get('href')) 

So I need help with the rest of the code. How do I get text in string format? and then, as I use the national language toolkit (nltk) to indicate sentences and words.

+7
python html-parsing web-scraping nltk beautifulsoup
source share
4 answers

Here is an example of how to capture all the links to a song on a page, follow them and get lyrics:

 from urlparse import urljoin from bs4 import BeautifulSoup import requests BASE_URL = "http://genius.com" artist_url = "http://genius.com/artists/Andre-3000/" response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'}) soup = BeautifulSoup(response.text, "lxml") for song_link in soup.select('ul.song_list > li > a'): link = urljoin(BASE_URL, song_link['href']) response = requests.get(link) soup = BeautifulSoup(response.text) lyrics = soup.find('div', class_='lyrics').text.strip() # tokenize `lyrics` with nltk 

Please note that requests are used here. Also note that the User-Agent header is required since the site returns 403 - Forbidden without it.

+3
source share

First, for each link you will need to download this page and analyze it using BeautifulSoup. Then find the distinguishing attribute on this page that separates texts from other page content. I found <data-editorial-state = "accepted" data-property = "accepted" data-group = "0" & ​​gt; to be helpful. Then run the .find_all file in the page text to get all the lyric lines. For each line, you can call .get_text () to get text from each line of text.

As for NLTK, after installing it, you can import it and parse sentences in this way:

 from nltk.tokenize import word_tokenize, sent_tokenize words = [word_tokenize(t) for t in sent_tokenize(lyric_text)] 

This will give you a list of all the words in each sentence.

+1
source share

GitHub / jashanj0tsingh / LyricsScraper.py provides basic scraping of genius.com lyrics to a text file, where each line represents a song. The name of the artist is required as an input. The generated text file can be easily uploaded to your custom nltk or general parser to make the material you need.

Code below:

 # A simple script to scrape lyrics from the genius.com based on atrtist name. import re import requests import time import codecs from bs4 import BeautifulSoup from selenium import webdriver mybrowser = webdriver.Chrome("path\to\chromedriver\binary") # Browser and path to Web driver you wish to automate your tests cases. user_input = input("Enter Artist Name = ").replace(" ","+") # User_Input = Artist Name base_url = "https://genius.com/search?q="+user_input # Append User_Input to search query mybrowser.get(base_url) # Open in browser t_sec = time.time() + 60*20 # seconds*minutes while(time.time()<t_sec): # Reach the bottom of the page as per time for now TODO: Better condition to check end of page. mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);") html = mybrowser.page_source soup = BeautifulSoup(html, "html.parser") time.sleep(5) pattern = re.compile("[\S]+-lyrics$") # Filter http links that end with "lyrics". pattern2 = re.compile("\[(.*?)\]") # Remove unnecessary text from the lyrics such as [Intro], [Chorus] etc.. with codecs.open('lyrics.txt','a','utf-8-sig') as myfile: for link in soup.find_all('a',href=True): if pattern.match(link['href']): f = requests.get(link['href']) lyricsoup = BeautifulSoup(f.content,"html.parser") #lyrics = lyricsoup.find("lyrics").get_text().replace("\n","") # Each song in one line. lyrics = lyricsoup.find("lyrics").get_text() # Line by Line lyrics = re.sub(pattern2, "", lyrics) myfile.write(lyrics+"\n") mybrowser.close() myfile.close() 
+1
source share

Hope this is still relevant! I do the same with Eminem lyrics, but from lyrics.com. Should it be from Rap Genius? I found lyrics.com to make it easier to scratch.

To get Andre 3000, just change the code accordingly.

Here is my code; he gets links to songs, and then rattles these pages to get the lyrics and adds the lyrics to the list:

 import re import requests import nltk from bs4 import BeautifulSoup url = 'http://www.lyrics.com/eminem' r = requests.get(url) soup = BeautifulSoup(r.content) gdata = soup.find_all('div',{'class':'row'}) eminemLyrics = [] for item in gdata: title = item.find_all('a',{'itemprop':'name'})[0].text lyricsdotcom = 'http://www.lyrics.com' for link in item('a'): try: lyriclink = lyricsdotcom+link.get('href') req = requests.get(lyriclink) lyricsoup = BeautifulSoup(req.content) lyricdata = lyricsoup.find_all('div',{'id':re.compile('lyric_space|lyrics')})[0].text eminemLyrics.append([title,lyricdata]) print title print lyricdata print except: pass 

This will give you the lyrics on the list. To print all the headers:

 titles = [i[0] for i in eminemLyrics] print titles 

To get a specific song:

 titles.index('Cleaning out My Closet') 120 

To tokenize a song, insert this value ( 120 ) in:

 song = nltk.word_tokenize(eminemLyrics[120][1]) nltk.pos_tag(song) 
0
source share

All Articles