Tika 1.1 Performance Improvement

I am using tika 1.1, I am facing the problem that tika takes a long time to extract contents from a file. It takes about ~ 3 seconds to extract 1 MB of pdf / doc file. Is there a way to improve performance? Any tuning, configuration that helps improve productivity.

I tried tika 1.4, but unfortunately for the same time in pdf format ~ 3.2 seconds.

I am using BodyContentHandler.

public class TikkaExtractor {
public static void main(String[] args) throws Exception {
    BodyContentHandler handler = new BodyContentHandler(10000);
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    InputStream content = TikkaExtractor.class.getResourceAsStream("demo.pdf");
    parser.parse(content, handler, metadata, new ParseContext());
    ContentHandlerDecorator contentHandlerDecorator = new ContentHandlerDecorator(handler);
    String s = contentHandlerDecorator.toString();
    content.close();
}

}

+4
source share

All Articles