Get the content (loaded via an AJAX call) of a web page

I'm starting to crawl. I have a requirement to receive posts and comments via the link. I want to automate this process. I considered using webcrawler and jsoup for this, but I was told that web browsers are mainly used for sites with more depth.

Sample Page: Jive Community Website

For this page, when I look at the source of the page, I see only the message, not the comments. Think about it because comments cause an AJAX call to the server.

Therefore, when I use jsoup, it does not receive comments.

So, how can I automate the process of collecting posts and comments?

+6
source share
2 answers

Jsoup is an html parser only. Unfortunately, it is not possible to parse the contents of javascript / ajax since jsoup cannot execute them.

Solution: use a library that can handle scripts.

Here are some examples that I know:

If such a library does not support parsing or a selector, you can at least use them to get Html from scripts (which jsoup can then parse).

+9
source

Jsoup does not handle Javascript and Ajax, so you need to use Htmlunit or selenium. After loading the page using Htmlunit or any of them, you can use jsoup for the rest of the task.

+2
source

All Articles