I have a start page http://www.example.com/startpage , which has 1220 paginated lists in a standard way, for example, 20 results per page.
I have working code that parses the first page of results and follows links that contain "example_guide / paris_shops" in their URL. Then I use Nokogiri to get specific data from this last page. Everything works well, and 20 results are written to a file.
However, I cannot figure out how to get Anemone to go to the next page of results (http://www.example.com/startpage?page=2), and then continue to analyze this page and then on the third page (http: // www. example.com/startpage?page=3) etc.
So, I would like to ask if anyone knows how I can get an anemone to start on a page, analyze all the links on this page (and the next level of data for specific data), but then follow the pagination on the next page pages of results so that the anemone can start parsing again and so on. Given that the links to the pages are different from the links in the results, Anemone, of course, does not follow them.
I am currently loading the URL of the first page of results, allowing me to finish and then paste into the next URL for the 2nd page of results, etc. Very manual and inefficient, especially for getting hundreds of pages.
Any help would be greatly appreciated.
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |
doc = Nokogiri::HTML(open(page.url))
name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?
open('savedwebdata.txt', 'a') { |f|
f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
end
end
source
share