How to determine the type of MIME text files: CSS, Javascript, ini, sql?

Finding the MIME type of a file using PHP is trivial - just use the PEAR MIME_Type package, PHP fileinfo, or call file -i on a Unix machine. This works very well for binary files and all others that have some kind of "magic bytes" through which they are easily detected.

As a result, I discovered the correct type of MIME text files:

  • CSS
  • Diff
  • INI (configuration)
  • Javascript
  • Rst
  • SQL

They are all identified as “text / plain,” which is correct, but too non-specific for me. I need a real type, even if it takes some time to parse the contents of the file.

So my question is: what are the solutions for detecting the MIME type of such text files? Any libraries? Code snippets?


Please note that I do not have a file name or file extension, but I have the contents of the file.


If I used ruby, I could integrate github linguist . Ohloh ohcount is written in C, but has a command line tool to determine the type: ohcount -d $file

What i tried

ohcount

Detects xml and php files correctly, but all the rest do not.

Apache tika

Detects xml and html, all other test files are considered only as text/plain .

+7
source share
4 answers

Since I did not find a suitable library, I wrote my own magic file, which correctly recognized all my test files.

My application first tries my custom magic file for detection and returns to the normal / system magic file if no type is found.

Code on github, see https://github.com/cweiske/MIME_Type_PlainDetect . The magic file is located in data / programming.magic and can be used with file -f programming.magic /path/to/source

+2
source

I think Apache Tika's magic discovery can help you:

http://tika.apache.org/

+2
source

How to do:

  • .ini To check ini files, you will use parse_ini_file . It returns false if the ini file is wrong.
  • .css . First check if you find something like body { , html { or body, html { . You can also use keywords from CSS, such as font-family , background , border , etc.
  • .sql . Most likely you will find something like INSERT INTO , UPDATE (.*) SET , CREATE TABLE , etc., again find the keywords.
  • .js For Javascript, you will need to search everything for all keywords ...

For others, I do not know them.

+2
source

I found this library: http://pear.php.net/package/MIME_Type/

As described, it "Provides functionality for working with MIME types." and gives the following features:

  • Parsing a MIME type.
  • Supports full RFC2045 specification.
  • Many utility functions for working with and defining type information.
  • Most functions can be called statically.
  • Auto-detect a mime-type file with the fileinfo extension, the mime_magic extension, the 'file' command, or the built-in list of mappings
0
source

All Articles