Apache Tika and document metadata

I do simple processing of various documents (ODS, MS office, pdf) using Apache Tika. I should get at least:

word count, author, title, timestamps, language etc. 

which is not so simple. My strategy uses the template method template for 6 types of documents, where I first find the type of document and based on this I process it separately.

I know that apache tika should eliminate the need for this, but the document formats are completely different?

for example

 InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new OfficeParser(); parser.parse(input, textHandler, metadata, new ParseContext()); input.close(); for(String s : metadata.names()) { System.out.println("Metadata name : " + s); } 

I tried to do this for ODS, MS office, pdf documents, and the metadad is very different. The MSOffice interface contains metadata lists for MS documents and a Dublic Core metadata list. But how to implement such an application?

Can anyone who has experience working with him share their experience? Thanks you

+7
source share
1 answer

As a rule, parsers should return the same metadata key for the same type in all document formats. However, there are some types of metadata that are found only in certain types of files, so you will not receive them from others.

You can just use AutoDetectParser, and if you need to do something special with a metadata descriptor, which will later be based on the mimetype type, for example

 Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, textHandler, metadata, new ParseContext()); if(metadata.get(CONTENT_TYPE).equals("application/pdf")) { // Do something special with the PDF metadata here } 
+6
source

All Articles