How to get html of a fully loaded page (with javascript) as input in java?

I need to parse the page, everything is in order, except that some elements on the page are loaded dynamically. I used jsoup for static elements, and then when I realized that I really needed dynamic elements, I tried javafx. I read a lot of answers on stackoverflow, and there were many recommendations for using javafx WebEngine. So, I am done with this code.

@Override public void start(Stage primaryStage) { WebView webview = new WebView(); final WebEngine webengine = webview.getEngine(); webengine.getLoadWorker().stateProperty().addListener( new ChangeListener<State>() { public void changed(ObservableValue ov, State oldState, State newState) { if (newState == Worker.State.SUCCEEDED) { Document doc = webengine.getDocument(); //Serialize DOM OutputFormat format = new OutputFormat (doc); // as a String StringWriter stringOut = new StringWriter (); XMLSerializer serial = new XMLSerializer (stringOut, format); try { serial.serialize(doc); } catch (IOException e) { e.printStackTrace(); } // Display the XML System.out.println(stringOut.toString()); } } }); webengine.load("http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658"); primaryStage.setScene(new Scene(webview, 800, 800)); primaryStage.show(); } 

I made a line from org.w3c.dom.Document and printed it. But it was useless. primaryStage.show () showed me a fully loaded page (with the element that I need to display on the page), but in the html code (at the output) there was no element that I needed.

On the third day I work on this problem, of course, lack of experience is my main problem, however I must say: I'm stuck. This is my first java project after reading the full java help. I do this to get a real experience (and for fun). I want to make a parser of Chinese "ebay".

Here is the problem and my test cases:

http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658 need to get a dynamically loaded discount of "129.00"

http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348 required "15.20"

As you can see, if you first view these pages in a browser, you see the original price after the second or so discount.

Is it possible to get these dynamic discounts from an html page? The other elements that I need for analysis are static. What to try next: another library for html rendering with javascript or maybe with another? I really need advice, I don't want to give up.

+7
java javascript html javafx javafx-2
source share
2 answers

The DOM model returned after Worker.State.SUCCEEDED shoulb has already been processed by javascript.

Your code worked for me with testing using FX 7u40 and 8.0 dev. I see the following output in the log:

 <DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM> <STRONG class="J_CurPrice">129.00</STRONG></DIV> 

which is a dynamically loaded data field ( 129.00 ) that you were looking for.

You might want to upgrade your JDK to 7u40 or revise the log analysis algorithm.

+1
source share

It looks like you want the DOM to render from a dynamic page after the Javascript on the page has finished modifying the original HTML. This would not be easy to do in Java, since you will need to implement browser functions with the built-in Javascript engine. If you only care about reading a web page with Java, you may need to learn Selenium as it takes control of the browser and allows you to pull out the rendered HTML in Java.

This answer may also help:

render JavaScript and HTML in any Java program (access to the provided DOM tree)?

0
source share

All Articles