Apache Tika: Parsing a text file omits the last part?

Question

Apache Tika: Parsing a text file omits the last part?

I am trying to parse a text file using Tika but disagree behavior.

In particular, I defined a simple handler as follows:

public class MyHandler extends DefaultHandler { @Override public void characters(char ch[], int start, int length) throws SAXException { System.out.println(new String(ch)); } }

Then I will parse the file (" myfile.txt ") as follows:

 Tika tika = new Tika(); InputStream is = new FileInputStream("myfile.txt"); Metadata metadata = new Metadata(); ContentHandler handler = new MyHandler(); Parser parser = new TXTParser(); ParseContext context = new ParseContext(); String mimeType = tika.detect(is); metadata.set(HttpHeaders.CONTENT_TYPE, mimeType); tikaParser.parse(is, handler, metadata, context);

I expect that all the text in the file will be printed on the screen, but there is no small part at the end. More specifically, the character callback () continues to read 4,096 characters per callback, but in the end it obviously excludes the last 5,083 characters of this file (which is a few MB long), so it even goes beyond the absence of the last callback.

In addition, testing on another, small file with a length of about 5000 characters, a callback does not occur!

The MIME type is correctly defined as text / plain in both cases.

Any ideas?

Thanks!

+4

java apache apache-tika

PNS Jul 07 '11 at 20:25

source share

1 answer

Johan sjöberg · Accepted Answer · 2011-07-07T20:49:45+0000

What version of Tika are you using? Examining the source code, it reads fragments of 4096 bytes, which can be seen on line 129 TXTParser . Line 132 calls the characters(...) procedure.

In short, the target code is:

  char[] buffer = new char[4096]; int n = reader.read(buffer); while (n != -1) { xhtml.characters(buffer, 0, n); n = reader.read(buffer); }

where reader is BufferedReader . I see no flaws in this code, so I think you can work with the older version?

Apache Tika: Parsing a text file omits the last part?

More articles: