In fact, Tika processes the pages (at least in pdf format), sending the <div><p> elements before the start of the page and </p></div> after the end of the page. You can easily set up the page counter in your handler using this (just by counting pages using only <p> ):
public abstract class MyContentHandler implements ContentHandler { private String pageTag = "p"; protected int pageNumber = 0; ... @Override public void startElement (String uri, String localName, String qName, Attributes atts) throws SAXException { if (pageTag.equals(qName)) { startPage(); } } @Override public void endElement (String uri, String localName, String qName) throws SAXException { if (pageTag.equals(qName)) { endPage(); } } protected void startPage() throws SAXException { pageNumber++; } protected void endPage() throws SAXException { return; } ... }
When doing this with pdf, you may run into a problem when the parser does not send text strings in the correct order - see Extracting text from PDF files using Apache Tika 0.9 (and PDFBox under the hood) on how to deal with this.
topchef
source share