Loading an entire HTML page?

I am trying to write a movie adaptation, and I want to load the main page of the website.

I do not get all the HTML that I see when I view the page source in a browser. How can I make sure that I download everything I see when I view the source code in a browser?

# Required Gems require 'rubygems' # Loads gems require "nokogiri" # Nokogiri require "open-uri" # For Nokogiri require "chronic" # For time parsing require "cgi" # For parsing urls require 'net/http' # For image downloading URL = URI.parse("http://www.gocrimson.com/landing/index") hBOList = Nokogiri::HTML(open(URL)) 
+4
source share
4 answers

The source presented in the browser does not necessarily match the request of the HTML file itself, because Ajax is used to load fragments of pages after the initial, requested page load.

You cannot use conventional methods to extract the source of a page if it uses JavaScript and Ajax, unless you decode the entire content chain and re-create them in your Ruby code.

Or you can use a browser that Ruby can talk to, tell it to load the start page, which will then lead to JavaScript actions in the browser, the browser will load additional content, then your code will be able to extract it and do what you want. To do this, you should look at Watir or one of its derivatives.

+3
source

whether:

 require 'open-uri' File.open("page_test.txt","w"){|f| f << open("http://www.gocrimson.com/landing/index").read} 

copy all the content of the desired page? If so, Nokogiri drops something somewhere and / or the guys load something via JavaScript after rendering the page. If not, your parsing code will be interesting.

0
source

I think some content is loaded through an ajax call when a button is clicked or after some action. If you know what you want and what action it does. Then you can see mechanize . Mechanization inside uses Nokogiri and helps with loading pages that require some action.

0
source

Hisako and the red doll, you should try watir as the tin man suggested above. Sort of:

 require 'rubygems' require 'watir-webdriver' browser = Watir::Browser.new browser.goto "http://www.gocrimson.com/landing/index" puts browser.html 

Gotta do what you want.

0
source

All Articles