How to clear pages with lazy loading

Here is the code I used to parse the webpage. I did this in the rails console. But I am not getting any output in my rails console. The site I want to clean up has lazy loading.

require 'nokogiri' require 'open-uri' page = 1 while true url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}" doc = Nokogiri::HTML(open(url)) doc = Nokogiri::HTML(doc.at_css('#ajax').text) d = doc.css(".rslwrp") d.each do |t| puts t.css(".jrcw").text puts t.css("span.jcn").text puts t.css(".jaid").text puts t.css(".estd").text page+=1 end end 
+5
source share
1 answer

You have 2 options:

  • Switch HTTP cleanup to some tool that supports javascript evaluation, for example Capybara (with the correct driver selected). This can be slow since you are using a browser without a browser under the hood, plus you have to set some timeouts or indicate another way to make sure that the blocks of text that interest you are loaded before starting any scrapers.

  • The second option is to use the Web Developer console and find out how these blocks of text are loaded (which AJAX calls, their parameters, etc.) and implement them in your scraper. This is a more advanced approach, but more effective, since you will not do any additional work, as you did in option 1.

Have a nice day!

UPDATE:

Your code above does not work because the response is HTML code wrapped in a JSON object while you are trying to parse it as raw HTML. It looks like this:

 { "error": 0, "msg": "request successful", "paidDocIds": "some ids here", "itemStartIndex": 20, "lastPageNum": 50, "markup": 'LOTS AND LOTS AND LOTS OF MARKUP' } 

You need to deploy JSON and then parse as HTML:

 require 'json' json = JSON.parse(open(url).read) # make sure you check http errors here html = json['markup'] # can this field be empty? check for the json['error'] field doc = Nokogiri::HTML(html) # parse as you like 

I will also advise you using open-uri , as your code may become vulnerable if you use dynamic URLs because of the way open-uri works (read the related article for details) and use good and more functional libraries like HTTParty and RestClient .

UPDATE 2: Minimum script work for me:

 require 'json' require 'open-uri' require 'nokogiri' url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2' json = JSON.parse(open(url).read) # make sure you check http errors here html = json['markup'] # can this field be empty? check for the json['error'] field doc = Nokogiri::HTML(html) # parse as you like puts doc.at_css('#newphoto10').attr('title') # => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi 
+4
source

All Articles