Help needed to screen screen using anemone and nokogiri

Question

Help needed to screen screen using anemone and nokogiri

I have a start page http://www.example.com/startpage , which has 1220 paginated lists in a standard way, for example, 20 results per page.

I have working code that parses the first page of results and follows links that contain "example_guide / paris_shops" in their URL. Then I use Nokogiri to get specific data from this last page. Everything works well, and 20 results are written to a file.

However, I cannot figure out how to get Anemone to go to the next page of results (http://www.example.com/startpage?page=2), and then continue to analyze this page and then on the third page (http: // www. example.com/startpage?page=3) etc.

So, I would like to ask if anyone knows how I can get an anemone to start on a page, analyze all the links on this page (and the next level of data for specific data), but then follow the pagination on the next page pages of results so that the anemone can start parsing again and so on. Given that the links to the pages are different from the links in the results, Anemone, of course, does not follow them.

I am currently loading the URL of the first page of results, allowing me to finish and then paste into the next URL for the 2nd page of results, etc. Very manual and inefficient, especially for getting hundreds of pages.

Any help would be greatly appreciated.

require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'

Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
  anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |

doc = Nokogiri::HTML(open(page.url))

name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?

open('savedwebdata.txt', 'a') { |f|
  f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
  end
end

+5

ruby screen-scraping nokogiri

ginga 01 Oct '10 at 4:46

source share

2 answers

Davinj · Answer 1 · 2010-10-04T05:25:01+0000

In fact, Anemone has a nokogiri document built into it. if you call page.doc, which is a nokogiri document, so there is no need to have two nokogiri documents

the Tin Man · Answer 2 · 2010-10-01T06:12:45+0000

Without actual HTML or a real site to swipe to give exact examples. I've done what you're trying to do a lot of times and you really only need open-uriand nokogiri.

, , , , 1200/20 = 60 . :

require 'open-uri'
require 'nokogiri'

1.upto(60) do |page_num|
  doc = Nokogiri::HTML(open("http://www.example.com/startpage?page=#{page_num}"))
  # ... grab the data you want ...
  # ... sleep n seconds to be nice ...
end

, Mechanize . , , , , , .. Nokogiri Nokogiri.

Help needed to screen screen using anemone and nokogiri

More articles: