Native Java document analyzer and Linux-based document converter / converters

I am looking for a Java library that can do the following:

parse emails in * .eml or * .msg format for attachments of the types DOC, DOCX, JPEG, PNG, GIF, TXT, XLS, XLSX, PPT, PDF and convert the attached files to TIFF format.

It can be either open source or with a commercial library. As an alternative, I'm looking for linux command line tools that do this. We have already tried an open office, but there are too many problems with some document formats.

UPDATE:

What I have discovered as a result of research so far:

For parsing email and extracting attachments, JavaMail (http://www.oracle.com/technetwork/java/javamail/index.html) is a good choice.

For converting documents, JodConverter (http://code.google.com/p/jodconverter/) is a convenient library. However, this is only a shell for an open office, so if you have problems with an open office (and I often have problems with openoffice), to convert a document, you will have them also with JodConcerter.

In conclusion, I was not lucky (so far) to find any document conversion library implemented in native Java, which transmits all common document formats, neither open source nor commercial. This seems to be a real market gap.

+4
source share
4 answers

RainbowPDF can fit: its a commercial server-based conversion tool with Java API.

If you have a Windows server, check out NEEVIA Document Converter Pro . It has some mail features.

Apace POI is an interface for reading the contents of Microsoft Office documents. You will have to independently code the components for generating and composing images. However, it reads the Outlook MSG format.

+2
source

Apache POI - Java API for Microsoft documents. However, I do not know how easy it is to convert the processed document to TIFF.

+1
source

Maybe a combination of different approaches can be useful? Depending on your requirements, you could use several libraries to convert all the formats you need to manage: Microsoft Office, Adobe PDF, some different image formats and plain text files.

I mean, you can create a process that, depending on the type of the extracted file (using Java Mail), you could find out what file format it has and continue processing using the correct conversion mechanism using the appropriate library. Then you will be idenfity, if the file is an image for conversion, try Java Advanced Imaging, if it is a Microsoft Office file, try Apache POI and so on. For managing PDF files, you can try Apache PDFBox is another good and open solution.

By the way, if you are looking for more than just a Java approach, perhaps this thread might help you.

I don't know if there are better commercial solutions than @ChrisGer commented.

0
source

Do not waste time looking for the Apache POI, as it can analyze the contents of Office files, but is not suitable for rendering it.

Since there are OpenOffice servers available, I suggest you do this. I also know that you can easily use DCOM to communicate with Microsoft Office applications, perhaps the Java-> DCOM bridge is more suitable for this task. However, this is not even recommended by Microsoft (so I believe that JodConverter is equally unstable).

-1
source

All Articles