Is it possible to extract text per page for word / pdf files using Apache Tika?

Question

Is it possible to extract text per page for word / pdf files using Apache Tika?

All the documentation I can find seems to suggest that I can only extract the entire file. But I need to retrieve the pages individually. Do I need to write my own parser for this? Is there some obvious method that I am missing?

+8

text apache-tika

Asif sheikh Apr 28 '11 at 20:53

source share

3 answers

You will need to work with basic libraries - Tika does nothing at the page level.

For PDF files, the PDFBox should be able to provide you with some materials on the page. For Word, HWPF and XWPF from Apache POI, they don’t actually do things at the page level - page breaks are not stored in the file, but instead need to be calculated "on the fly" based on text + fonts + page size ...

+5

Gagravarr Apr 29 '11 at 1:58

source share

You can get the number of pages in a Pdf using the metadata object xmpTPg:NPages , as shown below:

 Parser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); ParseContext parseContext = new ParseContext(); parser.parse(fis, handler, metadata, parseContext); metadata.get("xmpTPg:NPages");

+5

hd1 Jul 24 '13 at 21:22

source share

topchef · Accepted Answer · 2011-06-07T21:09:30+0000

In fact, Tika processes the pages (at least in pdf format), sending the <div><p> elements before the start of the page and </p></div> after the end of the page. You can easily set up the page counter in your handler using this (just by counting pages using only <p> ):

 public abstract class MyContentHandler implements ContentHandler { private String pageTag = "p"; protected int pageNumber = 0; ... @Override public void startElement (String uri, String localName, String qName, Attributes atts) throws SAXException { if (pageTag.equals(qName)) { startPage(); } } @Override public void endElement (String uri, String localName, String qName) throws SAXException { if (pageTag.equals(qName)) { endPage(); } } protected void startPage() throws SAXException { pageNumber++; } protected void endPage() throws SAXException { return; } ... }

When doing this with pdf, you may run into a problem when the parser does not send text strings in the correct order - see Extracting text from PDF files using Apache Tika 0.9 (and PDFBox under the hood) on how to deal with this.

Is it possible to extract text per page for word / pdf files using Apache Tika?

More articles: