Real Estate Web Scraper with R

As an intern in the economic research group, I was given the task of finding a way to automatically collect specific data on the website of an advertising real estate agency using R.

I assume the corresponding XML and RCurl , but my understanding of their work is very limited.

Here's the main page of the website: http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000 Ideally, I would like to build my database so that each row matches the ad.

Here is the announcement detail: http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s My variables: price ("Prix"), city ("Ville"), surface ("surface"), "GES" , "Classe énergie" and the number of rooms ("Pièces"), as well as the number of images shown in the ad. I would also like to export the text to a character vector, over which I will later perform text mining.

I am looking for any help, a link to a tutorial or How-to, which would give me an edge along the following path.

+7
source share
2 answers

You can use the XML package in R to clear this data. Here is a piece of code that should help.

 # DEFINE UTILITY FUNCTIONS # Function to Get Links to Ads by Page get_ad_links = function(page){ require(XML) # construct url to page url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/" url = paste(url_base, "?o=", page, "&zz=", 59000, sep = "") page = htmlTreeParse(url, useInternalNodes = T) # extract links to ads on page xp_exp = "//td/a[contains(@href, 'ventes_immobilieres')]" ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href") return(ad_links) } # Function to Get Ad Details by Ad URL get_ad_details = function(ad_url){ require(XML) # parse ad url to html tree doc = htmlTreeParse(ad_url, useInternalNodes = T) # extract labels and values using xpath expression labels = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue) values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue) values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue) values = c(values1, values2) # convert to data frame and add labels mydf = as.data.frame(t(values)) names(mydf) = labels return(mydf) } 

Here's how you could use these functions to extract information into a data frame.

 # grab ad links from page 1 ad_links = get_ad_links(page = 1) # grab ad details for first 5 links from page 1 require(plyr) ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text') 

This returns the following output

 Prix : Ville : Frais d'agence inclus : Type de bien : Pièces : Surface : Classe énergie : GES : 469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450) <NA> 469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450) <NA> 140 000 € 59000 Lille <NA> Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55) 140 000 € 59000 Lille <NA> Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55) 170 000 € 59000 Lille <NA> Appartement <NA> 50 m2 D (de 151 à 230) D (de 21 à 35) 

You can easily use the apply family of functions to cycle through multiple pages to get detailed information about all ads. Two things to remember. One of them is the legality of scraping from a website. Two is to use Sys.sleep in your looping function so that servers are not bombarded by requests.

Let me know how it works.

+12
source

This is a pretty big question, so you need to break it down into smaller ones and see which bits you are stuck with.

Is there a problem getting a webpage? (Beware of problems with the proxy server.) Or is it a complex bit that allows you to access useful bits of data? (You will probably need xPath for this.)

Take a look at the Rosetta code web scripting example and check out these SO questions for more information.

+4
source

All Articles