I do simple processing of various documents (ODS, MS office, pdf) using Apache Tika. I should get at least:
word count, author, title, timestamps, language etc.
which is not so simple. My strategy uses the template method template for 6 types of documents, where I first find the type of document and based on this I process it separately.
I know that apache tika should eliminate the need for this, but the document formats are completely different?
for example
InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new OfficeParser(); parser.parse(input, textHandler, metadata, new ParseContext()); input.close(); for(String s : metadata.names()) { System.out.println("Metadata name : " + s); }
I tried to do this for ODS, MS office, pdf documents, and the metadad is very different. The MSOffice interface contains metadata lists for MS documents and a Dublic Core metadata list. But how to implement such an application?
Can anyone who has experience working with him share their experience? Thanks you
lisak
source share