Best way to determine if a * .doc RTF file is with Java or ColdFusion

Question

Best way to determine if a * .doc RTF file is with Java or ColdFusion

So, I have about 4,000 word documents that I am trying to extract from the text and insert into the db table. This works smoothly until the processor encounters a document with a *.doc file extension, but determines that the file is actually RTF. Now I know that the POI does not support RTF, but this is true, but I need a way to determine if the *.doc RTF file is valid so that I can ignore the file and continue processing.

I tried several methods to overcome this, including using ColdFusion MimeTypeUtils, however it seems to base its assumption on mimetype on a file extension and still classifies RTF as an / msword application. Is there any other way to determine if *.doc RTF? Any help would be greatly appreciated.

+4

java coldfusion mime-types apache-poi

Anne porosoff Apr 26 '09 at 0:18

source share

4 answers

The first five bytes in any RTF file should be:

 {\rtf

If it is not, this is not an RTF file.

The external link section in the Wikipeida article refers to specifications for various versions of RTF.

Doc files (at least with Word 97) use something called the "Combined Binary Windows Format" documented in PDF here . Accordingly, these Doc files begin with the following sequence:

 0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1

Or in older beta versions:

 0x0e, 0x11, 0xfc, 0x0d, 0xd0, 0xcf, 0x11, 0xe0

According to the Wikipedia article on Word, there were at least 5 different formats before “97”.

Search {\ rtf should be the best choice.

Good luck, hope this helps.

+7

MBCook Apr 26 '09 at 0:45

source share

You can convert byteArray to string

 <cfset str = createObject("java", "java.lang.String").init(bytes)>

You can also try the hasxxxHeader methods from the POI source. They determine whether the input file can handle POI: OLE or OOXML. But I believe that someone else suggested using a simple try / catch to skip problem files. Is there a reason you don't want this? It would seem a simpler option.

Update: Peter's suggestion to use CF 8 function will also work

 <cfset input = FileOpen(pathToYourFile)> <cfset bytes = FileRead(input , 8)> <cfdump var="#bytes#"> <cfset FileClose(input)>

+1

Leigh Apr 27 '09 at 16:31

source share

You can try to identify files using the Droid tool (identification of a digital recording object), which provides access to the Pronom Technical Registry .

0

Fabian steeg Apr 26 '09 at 1:35

source share

Peter Boughton · Accepted Answer · 2009-04-26T02:53:49+0000

With CF8 and compatible:

 <cffunction name="IsRtfFile" returntype="Boolean" output="false"> <cfargument name="FileName" type="String" /> <cfreturn Left(FileRead(Arguments.FileName),5) EQ '{\rtf' /> </cffunction>

For earlier versions:

 <cffunction name="IsRtfFile" returntype="Boolean" output="false"> <cfargument name="FileName" type="String" /> <cfset var FileData = 0 /> <cffile variable="FileData" action="read" file="#Arguments.FileName#" /> <cfreturn Left(FileData,5) EQ '{\rtf' /> </cffunction>

Update: Best CF8 / compatible answer. In order not to load the entire file into memory, you can do the following to load only the first few characters:

 <cffunction name="IsRtfFile" returntype="Boolean" output="false"> <cfargument name="FileName" type="String" /> <cfset var FileData = 0 /> <cfloop index="FileData" file="#Arguments.FileName#" characters="5"> <cfbreak/> </cfloop> <cfreturn FileData EQ '{\rtf' /> </cffunction>

Based on the comments:
Here's a very quick way how you can generate a “what format is this” type of function. Not perfect, but it gives you an idea ...

 <cffunction name="determineFileFormat" returntype="String" output="false" hint="Determines format of file based on header of the file data." > <cfargument name="FileName" type="String"/> <cfset var FileData = 0 /> <cfset var CurFormat = 0 /> <cfset var MaxBytes = 8 /> <cfset var Formats = { WordNew : 'D0,CF,11,E0,A1,B1,1A,E1' , WordBeta : '0E,11,FC,0D,D0,CF,11,E0' , Rtf : '7B,5C,72,74,66' <!--- {\rtf ---> , Jpeg : 'FF,D8' }/> <cfloop index="FileData" file="#Arguments.FileName#" characters="#MaxBytes#"> <cfbreak/> </cfloop> <cfloop item="CurFormat" collection="#Formats#"> <cfif Left( FileData , ListLen(Formats[CurFormat]) ) EQ convertToText(Formats[CurFormat]) > <cfreturn CurFormat /> </cfif> </cfloop> <cfreturn "Unknown"/> </cffunction> <cffunction name="convertToText" returntype="String" output="false"> <cfargument name="HexList" type="String" /> <cfset var Result = "" /> <cfset var CurItem = 0 /> <cfloop index="CurItem" list="#Arguments.HexList#"> <cfset Result &= Chr(InputBaseN(CurItem,16)) /> </cfloop> <cfreturn Result /> </cffunction>

Of course, it is worth noting that all this will not work with formats without headings, including many common text (CFM, CSS, JS, etc.).

Best way to determine if a * .doc RTF file is with Java or ColdFusion

More articles: