Extract microdata from HTML in Java

I really need help extracting Mircodata embedded in HTML5. My goal is to get structured data from a web page just like this google tool: http://www.google.com/webmasters/tools/richsnippets . I searched a lot, but there is no solution.

I am currently using the any23 library, but I cannot find any documentation, only javadocs that do not provide me with enough information.

I am using any23 Microdata Extractor, but I am stuck in the third parameter: "org.w3c.dom.Document in". I cannot parse HTML content like w3cDom. I used JTidy as well as JSoup, but the DOM objects in this library are not fixed using the Extractor constructor. In addition, I also doubt the second parameter of the Microdata extradator.

I hope anyone can help me do with any23 or suggest another library to solve these extraction problems.

Change I found the solution myself, using the same method as the any23 command line tool. Here is the code snippet:

HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value); InputStream documentInputInputStream = doc.openInputStream(); TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI()); Document document = tagSoupParser.getDOM(); ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream(); MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput)); String result = byteArrayOutput.toString("UTF-8"); 

This line of code extracts only microdata from HTML and writes it in JSON format. I tried using MicrodataExtractor, which can change the output format for others (Rdf, turtle, ...), but the input document seems to accept only XML format. He throws "Document does not start" when I put an HTML document.

If someone found a way to use MicrodataExtractor, leave an answer here. Thanks.

+3
java extraction microdata
source share
1 answer

xpath is usually a way to use html or xml.

take a look at: How to read XML using XPath in Java

+1
source share

All Articles