Determine if the document is a DOC or DOCX in a Java application without knowing its extensions

There is a restriction in the content management system that requires the storage of all text documents with a certain extension (other than DOC or DOCX). However, when issuing a document, the user needs to know whether it is a DOC or DOCX file in order to ensure the correct MIME type.

So, is there a way to programmatically find out if a DOC or DOCX document is its contents?

+6
java doc docx
source share
2 answers

Here is a link to the ForensicsWiki, which describes many different types of files. It describes the headers of DOC and DOCX files, so you should be able to parse the files and determine what they are.

Looking at the link, the .doc files are OLE Compound Files, the file should have the following binary header:

d0 cf 11 e0 a1 b1 1a e1 

In constrast, .docx files will have a binary signature:

 50 4b 
+10
source share

DOCX files are in ZIP format, in which the first two bytes are the letters PK (after the creator of ZIP, Phil Katz).

+9
source share

All Articles