Clear understanding of files, file encoding, file format

I lack a clear understanding of file concepts, file encoding and file format. Google has helped to some extent. From what I understand so far, all files are binary, i.e. each byte in such a file can contain any of 256 possible bit strings, ASCII files (and here, where we get the encoding part) are a subset of binary files, where each byte uses only 7 bits.

And here, where things got messed up. The format file seems to be a way of interpreting bytes in the file, and file extensions are apparently one of the most used ways to determine the file format.

Does this mean that there are formats specific to binary files and formats specific to ASCII files? Are there any formats such as xml, pdf, doc, rtf, html, xls, sql, tex, java, cs "referring" to ASCII files? While formats like jpg, mp3, avi, eps, obj, out, dll are a hint, what are we talking about binary files?

+6
source share
4 answers

I do not think you can talk about ASCII and BINARY files, but TEXT and BINARY .

In this sense, these are text files: XML, HTML, RTF, SQL, TEXT, JAVA, CSS, EPS.

And these are binary files: PDF, DOC, XLS, JPG, MP3, AVI, OBJ, DLL.

ASCII is just a character table used at the beginning of calculations to represent text, but at present it is somewhat discouraged because it cannot represent text in languages ​​such as Chinese, Arabic, Spanish (word with Γ±, Γ‘, tildes), French and others. Instead of ASCII, other CHARACTER REPRESENTATIONS are currently offered. The most famous is probably UTF-8 . But there are others like ISO-8859-1 , ISO-8859-3 , etc. Take a look at this article by Joel Spolsky on UNICODE. It is very useful.

File formats are another very important issue. File formats are protocols that programs agree to represent information. In this sense, a JPG file is an image with a specific (well-known) internal format that allows programs (browsers, tables, word processors) to use them as images.

Text files also have formats (IE, there are specifications for text files such as XML and HTML). Its format, as in JPG and other binary files, allows applications to use them in a coherent and concrete way to achieve something: IE, render a WEB PAGE (HTML and XHTML format).

+7
source

The actual way the file is stored on the hard drive is determined by the OS. The actual contents of the file can be described as an array of bytes - each of them is up to bytes in size.

Text files - will use either a set of 256 char (ASCII), and then you can easily read them or a wider set of char - in this case - only suitable applications can read it.

The rest is what you can call binary (and any other formats that are "unreadable" for "text" viewers) - these are formats intended for reading by some other applications or the OS. if it is an executable file - the OS can read and execute them, others - like jpg - are designed to be "understood" using the photo viewer, etc ...

+2
source

This is an old question, but still very relevant. I was also confused, and asked for clarification. Here's a summary (hope this helps someone):

Format . File / Record Format is a Presentation Method . You can use CSV, TSV, JSON, Apache log format, Thrift format, Protobuf format, etc. to represent your data. The format is responsible for the correctness and correctness of the presentation of data. Example: when you read a json file, you must have a pair of nested keys; that warranty is always present.

{ "story": { "title": "beauty and the beast" } } 

Coding . Encoding basically converts your data (in any format or in plain text) into a specific scheme . So what is this circuit? The scheme is specific for encoding purposes. For example, when transmitting data via a wire (Internet), we would like to make sure that the json example above fits the other side correctly, should not be damaged. To ensure this, we would add some meta-information, such as a checksum, which can be used to verify the correctness of the data. Other uses of coding include data reduction, secret sharing, etc.

 Base64 encoding of above JSON example: ew0KICAgICAgICAic3RvcnkiOiB7DQogICAgICAgICAgICAidGl0bGUiOiAiYmVhdXR5IGFuZCB0aGUgYmVhc3QiDQogICAgICAgIH0NCn0= 
0
source

I think it's worth noting that with media files, mpeg and others are forms of media codecs. They explain how digital data can express visual and audio. Typically, they are placed in a container of multimedia files, such as an avi file, which is really a type of riff file that is designed for media.

0
source

All Articles