Programmatically find out the type of file by viewing its binary content. Possible?

I have a C # component that will receive a file of the following types .doc, .pdf, .xls, .rtf

They will be sent by the application to call siebel as a stream.

So...

[LegacyApp] → {Binary File Stream} → [Component]

A legacy application is a black box that cannot be modified to tell the component what type of file (doc, pdf, xls) it sends. The component must read this binary stream and create a file on the file system with the correct extension.

Any ideas?

Thank you for your time.

+6
c # file-type filestream
source share
5 answers

On Linux / Unix based systems, you can use the file command, but I assume you want to do this manually yourself in the code ...

If all you have access to is a file byte stream, you will need to process each file type independently.

Most programs / components that do what you are interested in usually read the first few bytes and make a classification based on this. For example, GIFs begin with one of the following: GIF87a or GIF89a

Many file formats have the same signature at the beginning of the file or have the same header format. This signature is called a magic number as I described in this post .

A good place to start is to go to www.wotsit.org . It contains file format options that are searchable by file type. You can look at the important types of files that you want to process, and see if you can find a specific factor in these file formats.

You can also search Google to find a library that performs this classification, or look at the source code of a file command.

+7
source share

Yes, it is possible, since MS Office files (97-2007 or so) start with D0CF11E, and then there is a subtype marker in byte 512.

Link to them: http://www.garykessler.net/library/file_sigs.html

This seems to be the best list, with all file formats - this is the main link to wikipedia.

It does not provide complete information about the new Office format, so this is from my own examples. DOCX files begin with "PK" (technically they are zip files) and then contain the string "word / _rels / document.xml.rels", while XLSX contains "xl / _rels / workbook.xml.rels".

+2
source share

You might be interested: http://en.wikipedia.org/wiki/Magic_number_(programming)

Most binary formats contain a magic number from the beginning. If you only need to recognize a specific set of formats, it should be easy for you to check the first few bytes of the new incoming file and correctly consider the corresponding file extension.

+1
source share

In linux there is a command called file . For an arbitrary file, it tries to determine which file it is. For example:

 gzip compressed data, from Unix, last modified: Fri Jun 12 20:16:28 2009 HTML document text vCalendar calendar file RCS/CVS diff output text 

This is from a few random files lying around my home directory.

0
source share

Yeah. See file .

And please do not reinvent the wheel. It works great like that.

0
source share

All Articles