You can use the XML
package in R to clear this data. Here is a piece of code that should help.
# DEFINE UTILITY FUNCTIONS # Function to Get Links to Ads by Page get_ad_links = function(page){ require(XML) # construct url to page url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/" url = paste(url_base, "?o=", page, "&zz=", 59000, sep = "") page = htmlTreeParse(url, useInternalNodes = T) # extract links to ads on page xp_exp = "//td/a[contains(@href, 'ventes_immobilieres')]" ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href") return(ad_links) } # Function to Get Ad Details by Ad URL get_ad_details = function(ad_url){ require(XML) # parse ad url to html tree doc = htmlTreeParse(ad_url, useInternalNodes = T) # extract labels and values using xpath expression labels = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue) values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue) values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue) values = c(values1, values2) # convert to data frame and add labels mydf = as.data.frame(t(values)) names(mydf) = labels return(mydf) }
Here's how you could use these functions to extract information into a data frame.
# grab ad links from page 1 ad_links = get_ad_links(page = 1)
This returns the following output
Prix : Ville : Frais d'agence inclus : Type de bien : Pièces : Surface : Classe énergie : GES : 469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450) <NA> 469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450) <NA> 140 000 € 59000 Lille <NA> Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55) 140 000 € 59000 Lille <NA> Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55) 170 000 € 59000 Lille <NA> Appartement <NA> 50 m2 D (de 151 à 230) D (de 21 à 35)
You can easily use the apply
family of functions to cycle through multiple pages to get detailed information about all ads. Two things to remember. One of them is the legality of scraping from a website. Two is to use Sys.sleep
in your looping function so that servers are not bombarded by requests.
Let me know how it works.
Ramnath
source share