Impossible site for HtmlUnit?

I cannot, for the life of me, install HtmlUnit to capture this site:

http://www.bing.com/travel/flight/flightSearch?form=FORMTRVLGENERIC&q=flights+from+SLC+to+BKK+leave+07%2F30%2F2010+return+08%2F11%2F2010+adults%3A1+ class% 3ACOACH & stoc = 0 & vo1 = Salt + Lake + City% 2C + UT +% 28SLC% 29 + - + Salt + Lake + City + International + Airport & o = SLC & ve1 = Bangkok% 2C + Thailand +% 28BKK% 29 + - + Suvarnabhumi + International & e = BKK & d1 = 07% 2F30% 2F2010 & r1 = 08% 2F11% 2F2010 & p = 1 & b = COACH & baf = true

I am sure that this is due to the huge number of scripts running in the background. Perhaps these scripts have not been given enough time to fully load?

I also tried just grabbing bing.com/travel and was unsuccessful. It breaks down into the getPage function of the new HtmlPage client.

The output gives a lot of runtimeErrors ("the data necessary to complete this operation is not yet available"), all for the same source Name (" http://www.bing.com/travel/jsxc.vjs?a=common&v=5.5 .0-1278007084280 ")

Then a few exceptions are excluded for the absence of "(" in the pairs of scripts on bing.com.

Then it calls javascript and then abruptly ends.

I understand that this may be several problems that others may not see, and therefore, if there are no suggestions, someone will not be able to download these two sites through a test implementation of their own use of HtmlUnit and see if they can get the basic XML result or text results? I'm not trying to do something interesting here, just get the base text or XML output.

It would be useful to know if any other implementation is working so that I can complete my jury project.

CODE:

import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.gargoylesoftware.htmlunit.WebClient; public class test { public static void main(String[] args) throws Exception { WebClient client = new WebClient(); System.out.println("webclient loaded"); HtmlPage currentPage = client.getPage("http://www.bing.com/travel/flight/flightSearch?form=FORMTRVLGENERIC&q=flights+from+SLC+to+BKK+leave+07%2F30%2F2010+return+08%2F11%2F2010+adults%3A1+class%3ACOACH&stoc=0&vo1=Salt+Lake+City%2C+UT+%28SLC%29+-+Salt+Lake+City+International+Airport&o=SLC&ve1=Bangkok%2C+Thailand+%28BKK%29+-+Suvarnabhumi+International&e=BKK&d1=07%2F30%2F2010&r1=08%2F11%2F2010&p=1&b=COACH&baf=true"); client.waitForBackgroundJavaScript(10000); System.out.println("htmlpage init'd"); //System.out.println(currentPage.getTitleText()); String textSource = currentPage.asXml(); System.out.println(textSource); } } 

Thanks!

+7
java javascript ajax screen-scraping htmlunit
source share
3 answers

Try adding this:

 client.setThrowExceptionOnScriptError( false ) ; 

It takes a long time to start, and the boy does this because of registration ... but in the end the page came out:

 htmlpage init'd <?xml version="1.0" encoding="utf-8"?> <html id=""> <head> ... 
+3
source share

I also had a problem with "the data needed to complete this operation is not yet available."
Switching the user agent to "Firefox" helped ...
http://steveliles.github.com/jquery_htmlunit_runtimeerror_messages_galore.html

+2
source share

Browsers have a high level of tolerance for what they can detect as errors (in Javascript, but also HTML, css, etc.). This is partly due to various conflicting "standards" :) about how Javascript was implemented. Something that appears OK in one browser causes problems with another. Therefore, when all these messages become visible, it should be a little discouraging.

To imagine this in perspective - in Internet Explorer go to your settings and check the "Advanced settings" for "Show notification of every error script", and then browse the same sites. You may be surprised at how much code IE gets by simply ignoring what it might detect as problems.

Using HtmlUnit in different browsers simply leads to coverage of some of these conflicts.

Telling HtmlUnit to do something like β€œIgnore ... for this browser” is a perfectly good practice. In my case, I am citing data from a site that checks that all users are using Internet Explorer (no, I have no good idea why they are doing this), so I can’t continue without ignoring javascript errors. Interestingly, the site works fine, although IE believes there are a lot of Javascript errors.

+2
source share

All Articles