Scanner in Groovy (JSoup VS Crawler4j)

I want to create a web crawler in Groovy (using the Grails database and the MongoDB database) that can crawl a site by creating a list of site URLs and their resource types, their contents, response time and the number of redirects.

I am discussing JSoup vs Crawler4j. I read about what they mostly do, but I cannot clearly understand the difference between them. Can anyone suggest which one is better for the above features? Or is it absolutely wrong to compare these two?

Thanks.

+6
web-crawler jsoup crawler4j
source share
1 answer

Crawler4J is a crawler , Jsoup is a parser. In fact, you could / should use both. Crawler4J is a simple multi-threaded interface for retrieving all the URLs and all pages (content) of the site you want. After that, you can use Jsoup to analyze the data, with amazing (jquery-like) css selectors and actually do something with it. Of course, you have to consider dynamic (generated javascript) content. If you want this content too, you need to use something else that includes a javascript mechanism (mute browser + parser) like htmlunit or webdriver (selenium) that will execute javascript before parsing the content.

+13
source share

All Articles