Java Web Scraper

I am not able to find any good Java based web API. The site I need to clean also does not provide an API; I want to pageID over all web pages using some pageID and extract HTML headers / other materials in their DOM trees.

Are there other ways besides cleaning the web pages?

+70
java frameworks web-scraping
Jul 08 2018-10-10T00:
source share
11 answers

jsoup

Retrieving the title is not difficult, and you have many options, look here for stack overflow "Java HTML parsers." One of them is Jsoup .

You can navigate the page using the DOM if you know the structure of the page, see http://jsoup.org/cookbook/extracting-data/dom-navigation

This is a good library, and I have used it in my latest projects.

+90
Jul 08 '10 at 9:44
source share

It is best to use the Selenium Web Driver as it

  • Provides visual feedback to the encoder (see your curettage in action, see where it stops).
  • Accurate and consistent as it directly controls the browser you are using.
  • Slow. Doesn’t hit web pages like HtmlUnit, but sometimes you don’t want to click too fast.

    Htmlunit is fast but terrible when handling Javascript and AJAX.

+20
Sep 23 '10 at 19:45
source share

HTMLUnit can be used for web cleaning, it supports page calling, filling out and submitting forms. I used this in my project. This is a good java library for web scraping. read here more

+13
Jul 21 '11 at 12:22
source share

mechanize for Java would be a good fit for this, and as Wadjy Essam said, it uses JSoup for HMLT. mechanize is a step-by-step HTTP / HTML client that supports navigation, form submission, and page cleanup.

http://gistlabs.com/software/mechanize-for-java/ (and here GitHub https://github.com/GistLabs/mechanize )

+4
Sep 17
source share

There is also Jaunt Java Web Scraping and JSON Querying - http://jaunt-api.com

+4
Sep 19 '17 at
source share

Take a look at an HTML parser such as TagSoup, HTMLCleaner, or NekoHTML.

+2
Jul 08 '10 at 9:45
source share

You can try the ui4j or cdp4j library for web scraping. ui4j requires Java 8 and uses the JavaFx WebKit browser, and cdp4j requires Chrome.

+2
Nov 11 '14 at 15:40
source share

You can take a peek at jwht-scrapper !

This is a complete recycling infrastructure that has all the features a developer can expect from a web scraper:

It works with the (jwht-htmltopojo) [ https://github.com/whimtrip/jwht-htmltopojo ) lib, which uses the Jsoup mentioned by several other people here.

Together they will help you create awesome scrapers that directly display HTML for POJO and bypass any classic scraping problems in just a few minutes!

Hope this helps some people!

Disclaimer, I am the one who developed it, feel free to tell me your comments!

+2
Aug 10 '18 at 15:39
source share

Using a web scraper, you can extract useful content from a web page and convert to any format, if applicable.

 WebScrap ws= new WebScrap(); //set your extracted website url ws.setUrl("http://dasnicdev.imtqy.com/webscrap4j/"); //start scrap session ws.startWebScrap(); 

Now your web recycling session begins and is ready to crash or retrieve data in java using the webscrap4j library .

For the title:

 System.out.println("-------------------Title-----------------------------"); System.out.println(ws.getSingleHTMLTagData("title")); 

For Tagline:

 System.out.println("-------------------Tagline-----------------------------"); System.out.println(ws.getSingleHTMLScriptData("<h2 id='project_tagline'>", "</h2>")); 

For all anchor tags:

 System.out.println("-------------------All anchor tag-----------------------------"); al=ws.getImageTagData("a", "href"); for(String adata: al) { System.out.println(adata); } 

For image data:

 System.out.println("-------------------Image data-----------------------------"); System.out.println(ws.getImageTagData("img", "src")); System.out.println(ws.getImageTagData("img", "alt")); 

For Ul-Li data:

 System.out.println("-------------------Ul-Li Data-----------------------------"); al=ws.getSingleHTMLScriptData("<ul>", "</ul>","<li>","</li>"); for(String str:al) { System.out.println(str); } 

For complete source code check out this tutorial .

+1
Jun 02 '15 at 8:37
source share

If you want to automate the cleaning of large pages or data, you can try Gotz ETL .

It is fully model driven like a true ETL tool. The data structure, task workflow, and pages to be cleaned are determined by the set of XML definition files, and encoding is not required. A query can be written either using selectors with JSoup, or XPath using HtmlUnit.

+1
Jan 23 '18 at 16:46
source share

There are many open source Java and python-based scanners available that you can customize to suit your requirements, some of which are described below.

 Apache nutch
 Stormcrawler
 Jsoup
 Jaunt

in your case, since you need a single price on the product page, you can create your own using JSoup, the framework available in Java, or the Beautiful Soup module in Python.

if the scale doesn't matter, and you just want to scan multiple pages daily, I recommend creating your own scanner. otherwise you can use Nutch or StormCrawler

Also for an individual order, please do not use multiple selectors for different web pages, in fact, just find a common tag, CSS or template that will give you a price.

0
Jun 25 '19 at 11:00
source share



All Articles