I am trying to parse a text file using Tika but disagree behavior.
In particular, I defined a simple handler as follows:
public class MyHandler extends DefaultHandler { @Override public void characters(char ch[], int start, int length) throws SAXException { System.out.println(new String(ch)); } }
Then I will parse the file (" myfile.txt ") as follows:
Tika tika = new Tika(); InputStream is = new FileInputStream("myfile.txt"); Metadata metadata = new Metadata(); ContentHandler handler = new MyHandler(); Parser parser = new TXTParser(); ParseContext context = new ParseContext(); String mimeType = tika.detect(is); metadata.set(HttpHeaders.CONTENT_TYPE, mimeType); tikaParser.parse(is, handler, metadata, context);
I expect that all the text in the file will be printed on the screen, but there is no small part at the end. More specifically, the character callback () continues to read 4,096 characters per callback, but in the end it obviously excludes the last 5,083 characters of this file (which is a few MB long), so it even goes beyond the absence of the last callback.
In addition, testing on another, small file with a length of about 5000 characters, a callback does not occur!
The MIME type is correctly defined as text / plain in both cases.
Any ideas?
Thanks!
source share