Rap Genius w / Python lyrics

Question

Rap Genius w / Python lyrics

I'm a little new to coding, and I tried to erase the Andre 3000 song from the genius Rap, http://genius.com/artists/Andre-3000 , using Beautiful Soup (Python library for pulling data from HTML and XML files). My ultimate goal is to have data in string format. Here is what I still have:

from bs4 import BeautifulSoup from urllib2 import urlopen artist_url = "http://rapgenius.com/artists/Andre-3000" def get_song_links(url): html = urlopen(url).read() # print html soup = BeautifulSoup(html, "lxml") container = soup.find("div", "container") song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")] print song_links get_song_links(artist_url) for link in soup.find_all('a'): print(link.get('href'))

So I need help with the rest of the code. How do I get text in string format? and then, as I use the national language toolkit (nltk) to indicate sentences and words.

+7

python html-parsing web-scraping nltk beautifulsoup

Ibrewster Jul 21 '14 at 20:07

source share

4 answers

First, for each link you will need to download this page and analyze it using BeautifulSoup. Then find the distinguishing attribute on this page that separates texts from other page content. I found <data-editorial-state = "accepted" data-property = "accepted" data-group = "0" & gt; to be helpful. Then run the .find_all file in the page text to get all the lyric lines. For each line, you can call .get_text () to get text from each line of text.

As for NLTK, after installing it, you can import it and parse sentences in this way:

 from nltk.tokenize import word_tokenize, sent_tokenize words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]

This will give you a list of all the words in each sentence.

+1

Andrew Johnson Jul 21 '14 at 20:19

source share

GitHub / jashanj0tsingh / LyricsScraper.py provides basic scraping of genius.com lyrics to a text file, where each line represents a song. The name of the artist is required as an input. The generated text file can be easily uploaded to your custom nltk or general parser to make the material you need.

Code below:

 # A simple script to scrape lyrics from the genius.com based on atrtist name. import re import requests import time import codecs from bs4 import BeautifulSoup from selenium import webdriver mybrowser = webdriver.Chrome("path\to\chromedriver\binary") # Browser and path to Web driver you wish to automate your tests cases. user_input = input("Enter Artist Name = ").replace(" ","+") # User_Input = Artist Name base_url = "https://genius.com/search?q="+user_input # Append User_Input to search query mybrowser.get(base_url) # Open in browser t_sec = time.time() + 60*20 # seconds*minutes while(time.time()<t_sec): # Reach the bottom of the page as per time for now TODO: Better condition to check end of page. mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);") html = mybrowser.page_source soup = BeautifulSoup(html, "html.parser") time.sleep(5) pattern = re.compile("[\S]+-lyrics$") # Filter http links that end with "lyrics". pattern2 = re.compile("\[(.*?)\]") # Remove unnecessary text from the lyrics such as [Intro], [Chorus] etc.. with codecs.open('lyrics.txt','a','utf-8-sig') as myfile: for link in soup.find_all('a',href=True): if pattern.match(link['href']): f = requests.get(link['href']) lyricsoup = BeautifulSoup(f.content,"html.parser") #lyrics = lyricsoup.find("lyrics").get_text().replace("\n","") # Each song in one line. lyrics = lyricsoup.find("lyrics").get_text() # Line by Line lyrics = re.sub(pattern2, "", lyrics) myfile.write(lyrics+"\n") mybrowser.close() myfile.close()

+1

pythonlearn Apr 10 '17 at 12:11

source share

Hope this is still relevant! I do the same with Eminem lyrics, but from lyrics.com. Should it be from Rap Genius? I found lyrics.com to make it easier to scratch.

To get Andre 3000, just change the code accordingly.

Here is my code; he gets links to songs, and then rattles these pages to get the lyrics and adds the lyrics to the list:

 import re import requests import nltk from bs4 import BeautifulSoup url = 'http://www.lyrics.com/eminem' r = requests.get(url) soup = BeautifulSoup(r.content) gdata = soup.find_all('div',{'class':'row'}) eminemLyrics = [] for item in gdata: title = item.find_all('a',{'itemprop':'name'})[0].text lyricsdotcom = 'http://www.lyrics.com' for link in item('a'): try: lyriclink = lyricsdotcom+link.get('href') req = requests.get(lyriclink) lyricsoup = BeautifulSoup(req.content) lyricdata = lyricsoup.find_all('div',{'id':re.compile('lyric_space|lyrics')})[0].text eminemLyrics.append([title,lyricdata]) print title print lyricdata print except: pass

This will give you the lyrics on the list. To print all the headers:

 titles = [i[0] for i in eminemLyrics] print titles

To get a specific song:

 titles.index('Cleaning out My Closet') 120

To tokenize a song, insert this value ( 120 ) in:

 song = nltk.word_tokenize(eminemLyrics[120][1]) nltk.pos_tag(song)

0

tmthyjames Nov 26 '14 at 18:33

source share

alecxe · Accepted Answer · 2014-07-21T20:17:03+0000

Here is an example of how to capture all the links to a song on a page, follow them and get lyrics:

 from urlparse import urljoin from bs4 import BeautifulSoup import requests BASE_URL = "http://genius.com" artist_url = "http://genius.com/artists/Andre-3000/" response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'}) soup = BeautifulSoup(response.text, "lxml") for song_link in soup.select('ul.song_list > li > a'): link = urljoin(BASE_URL, song_link['href']) response = requests.get(link) soup = BeautifulSoup(response.text) lyrics = soup.find('div', class_='lyrics').text.strip() # tokenize `lyrics` with nltk

Please note that requests are used here. Also note that the User-Agent header is required since the site returns 403 - Forbidden without it.

Rap Genius w / Python lyrics

More articles: