How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

Question

How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

I download the tika-core and tika-parser libraries, but I could not find code examples for parsing HTML documents. I have to get rid of all the html tags of the webpage source. What can I do? How can I use Apache Tika code?

+7

java html apache apache-tika

lkalay Mar 25 '11 at 7:47

source share

2 answers

You can also use Tika AutoDetectParser to parse any type of file, such as HTML. Here is a simple example:

  try { InputStream input = new FileInputStream(new File(path)); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); parser.parse(input, textHandler, metadata, context); System.out.println("Title: " + metadata.get(metadata.TITLE)); System.out.println("Body: " + textHandler.toString()); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); }

+1

UserNeD Aug 12 '14 at 10:51

source share

Gagravarr · Accepted Answer · 2011-04-02T10:15:39+0000

Do you need a text version of the html file? If so, all you need is something like:

InputStream input = new FileInputStream("myfile.html"); ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); new HtmlParser().parse(input, handler, metadata, new ParseContext()); String plainText = handler.toString();

A BodyContentHandler created without constructor arguments or with a character limit will capture the text (only) of the html body and return it to you.

How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

More articles: