Detecting if a file is binary or plain text?

How to determine if a file is binary or plain text?

Mostly my .NET application processes batch files and retrieves data, however I do not want to process binary files.

As a solution, I am going to parse the first X bytes of the file and, if there are more non-printable characters than printed characters, it should be binary.

Is this right to do? Is there a better implementation for this task?

+4
source share
4 answers

What exactly do you mean by binary? Is the Art of War written in Chinese? How about a Japanese-English dictionary?

There is no 100% way.

You will need to use some kind of heuristic.

Some options may look like this:

If the above (especially signatures and file extensions) do not help, try to guess based on the presence / absence of certains bytes (how you do it).

Note. It is better to check the extensions / signatures first, since you only need to read a few bytes / metadata of the file, and this will be quite effective compared to actually reading the entire file.

+6
source

The Unix file command does this in a smart way. Of course, this is a lot more, but you can check the algorithm here and then create something specialized.


UPDATE: Looks like the link above. Try this .

+4
source

You can reselect the first number of X bytes and give the correct match if all bytes are in the corresponding character class . But this may suggest that you know the encoding.

0
source

I think the best way to do this is to take no more than the first X bytes from the file (X may be 256, 512, etc.), count the number of characters that are not used by ASCII files (ascii codes are allowed: 10, 13, 32-126). If you know for sure that the script is written in English, no character can be outside the specified set. If you are not sure about this language, you can allow no more than Y char to be outside the set (if X is 512, I would choose Y 8 or 10).

If this is not enough, you can use more restrictions, such as: depending on the syntax of the files, such keywords should be present (for example: there should be some kind of echo for your batch files, for, if, goto, call, exit, etc. d.)

0
source

Source: https://habr.com/ru/post/1311115/


All Articles