How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

I download the tika-core and tika-parser libraries, but I could not find code examples for parsing HTML documents. I have to get rid of all the html tags of the webpage source. What can I do? How can I use Apache Tika code?

+7
source share
2 answers

Do you need a text version of the html file? If so, all you need is something like:

InputStream input = new FileInputStream("myfile.html"); ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); new HtmlParser().parse(input, handler, metadata, new ParseContext()); String plainText = handler.toString(); 

A BodyContentHandler created without constructor arguments or with a character limit will capture the text (only) of the html body and return it to you.

+19
source

You can also use Tika AutoDetectParser to parse any type of file, such as HTML. Here is a simple example:

  try { InputStream input = new FileInputStream(new File(path)); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); parser.parse(input, textHandler, metadata, context); System.out.println("Title: " + metadata.get(metadata.TITLE)); System.out.println("Body: " + textHandler.toString()); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } 
+1
source

All Articles