Python, file (1) - Why the numbers [7,8,9,10,12,13,27] and the range (0x20, 0x100) are used to define a text or binary file

As for the solution to determine if the file is binary or text in python , the responder uses:

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100)) 

and then uses .translate(None, textchars) to remove (or replace with nothing) all such characters in a binary file.

The defendant also claims that this choice of numbers is “based on the action of the file (1)” (for which text and what not). What is so important in these numbers is the definition of text files from binary files?

+2
python binary unicode ascii hex
Aug 24 '15 at 14:27
source share
1 answer

They are the most common code points for printed text, as well as newlines, spaces and carriage returns, and the like. ASCII is covered up to 0x7F, while standards such as Latin-1 or Windows Codepage 1251 use the remaining 128 bytes for accented characters, etc.

You expect text to use only these code points. Binary data will use all code points in the range 0x00-0xFF; for example, a text file will probably not use \ x00 (NUL) or \ x1F (Unit Separator in ASCII).

At best, it is a heuristic. Some text files may still try to use C0 control codes outside of these 7 characters, explicitly named, and I'm sure there are binary data that happens to not include 25 byte values ​​not included in the textchars string.

The author of the range probably based it on the text_chars table from the file command. It marks bytes as non-text, ASCII, Latin-1, or non-ISO extended ASCII and includes documentation on why these code points are selected:

 /* * This table reflects a particular philosophy about what constitutes * "text," and there is room for disagreement about it. * * [....] * * The table below considers a file to be ASCII if all of its characters * are either ASCII printing characters (again, according to the X3.4 * standard, not isascii()) or any of the following controls: bell, * backspace, tab, line feed, form feed, carriage return, esc, nextline. * * I include bell because some programs (particularly shell scripts) * use it literally, even though it is rare in normal text. I exclude * vertical tab because it never seems to be used in real text. I also * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85), * because that what the dd EBCDIC->ASCII table maps the EBCDIC newline * character to. It might be more appropriate to include it in the 8859 * set instead of the ASCII set, but it got to be included in *something* * we recognize or EBCDIC files aren't going to be considered textual. * * [.....] */ 

Interestingly, in this table there is no 0x7F that did not find the code you found.

+4
Aug 24 '15 at 14:29
source share



All Articles