Linux + check if the file is text or binary

How to check if a file is binary or text without opening the file?

+6
linux
source share
5 answers
I'm afraid Schrödinger's cat.

It is not possible to determine the contents of a file without opening it. The file system does not contain metadata related to the content.

If you do not open the file, this is not a strict requirement, then there are a number of solutions available to you.

Edit:

A number of comments and answers suggested that file(1) is a good way to define content. Indeed. However, file(1) opens a file that was forbidden in the question. See the penultimate line in the following example:

 > echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0 lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0 stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0 open("file.jpg", O_RDONLY|O_LARGEFILE) = 3 write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text 
+9
source share

The correct way to determine the type of file is to use the file (1) command.

You also need to know that UTF-8 encoded files are “text” files, but may contain non-ASCII data. Other encodings also have this problem. In the case of text encoded using a code page , it may not be possible to unambiguously determine whether the file is text or not.

The file structure will be examined in file (1) in order to try to determine what it contains - from the file (1) of the man page:

The print type will usually contain one of the words text (the file contains only printed characters and several common control characters and is probably safe to read on the ASCII terminal), executable (the file contains the result of compiling the program into a form that is understandable for some UNIX kernel or another), or data , which means something else (data is usually “binary or non-printable”).

As for the various character encodings, the file page (1) says:

If the file does not match any of the entries in the magic file, this is to check if it seems to be a text file. ASCII, ISO-8859-x, ISO 8-bit ASCII extended characters (for example, used on macs and IBM PC systems), UTF-8 Unicode encoded, Unicode UTF-16 encoded and EBCDIC character sets can be distinguished by different ranges as well the sequence of bytes that make up the typed text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and Extended ASCII files are identified as' text because they will be mostly readable on almost any terminal; UTF-16 and EBCDIC are only “character data” because, although they contain text, it is text that will require translation before it can be read.

So, some text will be designated as text , but some can be identified as character data . You will need to identify yourself if it matters to your application and take appropriate action.

+6
source share

There is no way to be sure without looking at the file. However, you do not need to open it with an editor and see that you have a key. You can look at the file command: http://linux.die.net/man/1/file

+2
source share

If you are trying to do this from the shell, then the file command will guess what type of file it is. If it is text, it will usually include the text of the word in its description.

I am not aware of a 100% method for determining this, but the file command is probably the most accurate.

+2
source share

On unix, a file has only a few bytes. Thus, without opening the file, you cannot determine 100% that it is ASCII or Binary.

You can simply use the tools available to you and dig deeper to make it proof.

  • file
  • cat -v
+2
source share

All Articles