HTML parsing in Java

Question

HTML parsing in Java

I am working on an application that dumps data from a website, and I was wondering how I need to collect data. In particular, I need data contained in several div tags that use a specific CSS class. Currently (for testing purposes) I'm just checking

div class = "classname"

in every line of HTML - this works, but I cannot help but feel that there is a better solution.

Is there any good way that I could give the class an HTML string and have some good methods, for example:

 boolean usesClass(String CSSClassname); String getText(); String getLink();

+50

java html parsing web-scraping

Richard Walton Oct 26 '08 at 13:57

source share

11 answers

Another library that may be useful for processing HTML is jsoup. Jsoup tries to clear the invalid HTML and enables html analysis in Java using jQuery selector syntax.

http://jsoup.org/

+58

rajsite May 18 '11 at 9:33 a.m.

source share

The main problem outlined in the previous recommendations is garbled HTML, so an html filter or HTML-XML converter is needed. Once you get the XML code (XHTML), there are many tools for processing it. You can get it with a simple SAX handler that retrieves only the data you need or any tree method (DOM, JDOM, etc.) that will even let you change the source code.

Here is an example of the code that the HTML cleaner uses to get all DIVs that use a particular class and print all the text inside it.

 import java.io.IOException; import java.net.URL; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.TagNode; /** * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom> */ public class TestHtmlParse { static final String className = "tags"; static final String url = "http://www.stackoverflow.com"; TagNode rootNode; public TestHtmlParse(URL htmlPage) throws IOException { HtmlCleaner cleaner = new HtmlCleaner(); rootNode = cleaner.clean(htmlPage); } List getDivsByClass(String CSSClassname) { List divList = new ArrayList(); TagNode divElements[] = rootNode.getElementsByName("div", true); for (int i = 0; divElements != null && i < divElements.length; i++) { String classType = divElements[i].getAttributeByName("class"); if (classType != null && classType.equals(CSSClassname)) { divList.add(divElements[i]); } } return divList; } public static void main(String[] args) { try { TestHtmlParse thp = new TestHtmlParse(new URL(url)); List divs = thp.getDivsByClass(className); System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***"); for (Iterator iterator = divs.iterator(); iterator.hasNext();) { TagNode divElement = (TagNode) iterator.next(); System.out.println("Text child nodes of DIV: " + divElement.getText().toString()); } } catch(Exception e) { e.printStackTrace(); } } }

+20

Fernando Miguélez Oct 26 '08 at 14:55

source share

You might be interested in TagSoup , a Java Java parser capable of handling invalid HTML. XML parsers will only work on well-formed XHTML.

+13

PhiLho Oct 26 '08 at 14:16

source share

Perhaps the HTMLParser project ( http://htmlparser.sourceforge.net/ ). It seems to be pretty decent in handling garbled HTML. The following snippet should do what you need:

 Parser parser = new Parser(htmlInput); CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("DIV.targetClassName"); NodeList nodes = parser.parse(cssFilter);

+5

dave Oct 26 '08 at 14:23

source share

Jericho: http://jericho.htmlparser.net/docs/index.html

Easy to use, supports poorly formed HTML, many examples.

+5

FolksLord Jan 21 '11 at 18:36

source share

HTMLUnit can help. This makes a lot more stuff.

http://htmlunit.sourceforge.net/ 1

+4

alex Oct 26 '08 at 19:16

source share

Don’t forget Jerry , his jQuery in java: a quick and concise Java library that makes it easy to parse, process, and manipulate HTML documents; includes the use of css3 selectors.

Example:

 Jerry doc = jerry(html); doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");

Example:

 doc.form("#myform", new JerryFormHandler() { public void onForm(Jerry form, Map<String, String[]> parameters) { // process form and parameters } });

Of course, these are just some quick examples to understand how everything looks.

+4

igr Jan 08 '12 at 17:37

source share

The nu.validator project is an excellent high-performance HTML parser that does not cut corners correctly.

Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is intended to replace notes for the XML parser in applications that already support XHTML 1.x content using the XML parser and use SAX, DOM, or XOM to interact with the parser. Low-level functionality is provided for applications that want to run their own IO and support document.write () with scripts. The analyzer core is compiled into the Google Web Toolkit and can be automatically translated into C ++. (Currently, C ++ translation capabilities are used to transfer the parser for use in Gecko.)

+3

Mike Samuel Aug 19 '11 at 0:13

source share

You can also use the XWiki HTML Cleaner :

It uses an HTMLCleaner and extends it to generate valid XHTML 1.1 content.

+1

Vincent Massol 04 Oct 2018-11-15T00:

source share

If your HTML is well-formed, you can easily use the XML parser to do the job ... If you are just reading, SAX would be perfect.

0

Yuval Oct 26 '08 at 14:01

source share

user31586 · Accepted Answer · 2008-10-26 16:06

A few years ago, I used JTidy for the same purpose:

http://jtidy.sourceforge.net/

"JTidy is a Tidy HTML Java port, an HTML parser and a beautiful printer. Like its non-Java cousin, JTidy can be used as a tool to clean garbled and broken HTML. In addition, JTidy provides a DOM interface to the document being processed, which effectively allows you to use JTidy as a DOM parser for real HTML code.

JTidy was written by Andy Back, who later resigned as an assistant. JTidy is now supported by a group of volunteers.

More information about JTidy can be found on the JTidy SourceForge project page.

HTML parsing in Java

More articles: