Get innerHTML via Jsoup

Question

Get innerHTML via Jsoup

I am trying to clear data from this site: http://www.bundesliga.de/de/liga/tabelle/

In the source code, I see tables, but there is no content there, just things like:

<td>[no content]</td> <td>[no content]</td> <td>[no content]</td> <td>[no content]</td> ....

With firebug (F12 in Firefox) I will not see any content either, but I can select a table and then copy innerHTML via the firebug option. In this case, I get all the information about the commands, but I do not know how to get the table with the contents in Jsoup.

+6

html web-scraping jsoup

unrated Feb 22 '14 at 15:05

source share

2 answers

To get the value of the attribute, use the Node.attr (String key) method. For the text of the element (and its children) use Element.text (). For HTML, use Element.html () or Node.outerHtml () depending on the situation. For example:

 String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; Document doc = Jsoup.parse(html); Element link = doc.select("a").first(); String text = doc.body().text(); // "An example link" String linkHref = link.attr("href"); // "http://example.com/" String linkText = link.text(); // "example"" String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>" String linkInnerH = link.html(); // "<b>example</b>"

link: http://jsoup.org/cookbook/extracting-data/attributes-text-html

+4

Adel Feb 23 '14 at 10:56

source share

luksch · Accepted Answer · 2014-02-23T10:53:52+0000

The table is not displayed directly on the server, but is created on the client side of the JavaScript page and is created with data that is sent to the client through AJAX. As such, you are expected to get a naive Jsoup approach.

I see two possible solutions:

You analyze network traffic and identify the ajax calls that the site makes. Then you try to restore the format and run the same queries as JavaScript. Then you can restore the table.
you are not using Jsoup, but a real browser that loads the page and launches JavaScript, including all AJAX calls. You can use Selenium webdriver for this. There is a mute browser called phantomjs that has a relatively small area that you can use in conjunction with the selenium web server.

Both options have their advantages ():

This takes longer since you need to understand that network traffic is pretty good. The reward will be a very fast and economical scraper.
Programming selenium is very simple, and you should not have any difficulty in achieving your goal. You do not need to understand the internal workings of the site you want to clean. However, the price is an additional dependency in your project. The memory consumption is high. Another process is going on. The scraper will be slow.

Maybe you will find another source with a football table that stores the information you want? This may be the easiest. For example http://www.fussballdaten.de/bundesliga/

Get innerHTML via Jsoup

More articles: