How to calculate file entropy?

Question

How to calculate file entropy?

How to calculate file entropy? (Or just say a bunch of bytes)
I have an idea, but I'm not sure if it is mathematically correct.

My idea is this:

Create an array of 256 integers (all zeros).
Go through the file and for each of its bytes,
increase the corresponding position in the array.
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array entries:
add the difference in the record to the "average" counter.

OK, now I'm stuck. How to "design" a result counter in such a way that all results will be between 0.0 and 1.0? But I'm sure the idea is incompatible anyway ...

I hope someone has better and simpler solutions?

Note. I need everything to make assumptions about the contents of the file:
(plaintext, markup, compression, or some binary, ...)

+60

algorithm file-io entropy

ivan_ivanovich_ivanoff Jun 13 '09 at 10:23

source share

11 answers

To calculate the entropy of information about a set of bytes, you will need to do something similar to tydok's answer. (tydok's answer is working on a set of bits.)

It is assumed that the following variables already exist:

byte_counts is a 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes with a value of 2 .
total - the total number of bytes in your file.

I will write the following code in Python, but it should be obvious what is happening.

 import math entropy = 0 for count in byte_counts: # If no bytes of this value were seen in the value, it doesn't affect # the entropy of the file. if count == 0: continue # p is the probability of seeing this byte in the file, as a floating- # point number p = 1.0 * count / total entropy -= p * math.log(p, 256)

It is important to note a few things.

Checking count == 0 is not just an optimization. If count == 0 , then p == 0 , and log (p) will be undefined ("negative infinity"), causing an error.
256 in a call to math.log represents the number of discrete values that are possible. An eight-bit byte will have 256 possible values.

The resulting value will be between 0 (each byte in the file is the same) to 1 (bytes are evenly distributed between all possible byte values).

Database Usage Explanation 256

It is true that this algorithm is usually applied using database 2. This gives the answer in bits. In this case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: increase the entropy of input by making byte_counts list of all 1 or 2 or 100 . When the bytes of a file are evenly distributed, you will find that there is an entropy of 8 bits.

You can use other logarithmic bases. Using b = 2 allows you to get the result in bits, since each bit can have 2 values. Using b = 10 puts the result in bits or decimal bits, since there are 10 possible values for each column. Using b = 256 will give the result in bytes, since each byte can have one of 256 discrete values.

Interestingly, using log identifiers, you can decide how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, deliberate side effect, this gives entropy as a value from 0 to 1.

In short:

You can use different units to express entropy
Most people express entropy in bits (b = 2)
- For a set of bytes this gives a maximum entropy of 8 bits
- Since the searcher wants to get a result between 0 and 1, divide this result by 8 for a meaningful value
The algorithm above calculates the entropy in bytes (b = 256)
- This is equivalent (entropy in bits) / 8
- This already gives a value from 0 to 1

+29

Wesley Jun 13 '09 at 13:14

source share

A simpler solution: gzip file. Use the file size ratio: (gzipped-size) / (original size) as a measure of randomness (i.e. Entropy).

This method does not give you the exact absolute value of entropy (because gzip is not an “ideal” compressor), but it is good enough if you need to compare the entropy of different sources.

+25

Igor Krivokon Jun 13 '09 at 10:38

source share

For what it's worth, here are the traditional (entropy bits) calculations presented in C #

 /// <summary> /// returns bits of entropy represented in a given string, per /// http://en.wikipedia.org/wiki/Entropy_(information_theory) /// </summary> public static double ShannonEntropy(string s) { var map = new Dictionary<char, int>(); foreach (char c in s) { if (!map.ContainsKey(c)) map.Add(c, 1); else map[c] += 1; } double result = 0.0; int len = s.Length; foreach (var item in map) { var frequency = (double)item.Value / len; result -= frequency * (Math.Log(frequency) / Math.Log(2)); } return result; }

+16

Jeff Atwood Feb 21 '11 at 10:01

source share

Is this something ent can handle? (Or perhaps it is not available on your platform.)

 $ dd if=/dev/urandom of=file bs=1024 count=10 $ ent file Entropy = 7.983185 bits per byte. ...

As an example of a counter, here is a file without entropy.

 $ dd if=/dev/zero of=file bs=1024 count=10 $ ent file Entropy = 0.000000 bits per byte. ...

+12

Peter Kovacs Jun 13 '09 at 20:58

source share

There is no such thing as file entropy. In information theory, entropy is a function of a random variable , not a fixed dataset (well, a technically fixed dataset has entropy, but entropy is 0 - we can consider the data as a random distribution that has only one possible result with probability 1).

To calculate the entropy, you need a random variable with which you can simulate a file. Then the entropy will be the entropy of the distribution of this random variable. This entropy will be equal to the number of bits of information contained in this random variable.

+11

Adam Rosenfield Jun 13 '09 at 20:47

source share

I was two years late, so please consider this despite a few votes.

Short answer: use my 1st and 3rd bold equations below to find out what most people think when they say “entropy” of a file in bits. Use only the 1st equation if you want the Shannon N entropy, which is actually an entropy / symbol, as he stated 13 times in his article, which most people are not aware of. Some online entropy calculators use this, but Shannon N is “specific entropy” rather than “total entropy”, which caused so much confusion. Use the 1st and 2nd equations if you need an answer between 0 and 1, which is normalized entropy / symbol (this is not a bit / symbol, but a true statistical measure of the "entropy nature" of the data, allowing the data to choose its own log base instead arbitrary use 2, e or 10).

There are 4 types of file (data) entropy of N characters long with n unique character types. But keep in mind that, knowing the contents of the file, you know the state it is in, and therefore S = 0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy / nature of this source. If you use the following in a file, it’s more accurate to say that it estimates the expected entropy of other files from this source.

Shannon (specific) entropy H = -1 * sum (count_i / N * log (count_i / N))
where count_i is the number of characters I met in N.
Units are bit / symbol, if log is base 2, nats / symbol if natural log.
Normalized Specific Entropy: H / log (n)
Units are entropy / symbol. Ranges from 0 to 1. 1 means that each character occurs equally often, and about 0 - all characters except 1 occur only once, and the rest of a very long file is a different character. The journal is in the same database as H.
Absolute Entropy S = N * H
Units are bits if log is base 2, nats if ln ()).
Normalized Absolute Entropy S = N * H / log (n)
The unit is "entropy", varies from 0 to N. The journal is in the same base as H.

Although the latter is the surest "entropy", the first (the entropy of Hannon Hannon H) is what all books call "entropy" without (necessary IMHO) qualifications. Most of them do not specify (for example, Shannon) that it is a bit / symbol or entropy per symbol. The challenge of H-entropy is too loose.

For files with the same frequency of each character: S = N * H = N. This applies to most large bit files. Entropy does not do any data compression and, therefore, does not know any patterns completely, therefore 000000111111 has the same H and S as 010111101000 (6 1 and 6 0 in both cases).

Like other users, using a standard compression procedure such as gzip and before and after division will give a better idea of the amount of pre-existing “order” in the file, but it is biased against data that fits the compression scheme better. There is no completely optimized general purpose compressor that we can use to determine the absolute "order".

Another thing to consider: H changes if you change the way you express data. H will be different if you choose different groupings of bits (bits, nibbles, bytes or hexadecimal). Thus, you divide by log (n), where n is the number of unique characters in the data (2 for binary, 256 bytes), and H will vary from 0 to 1 (this is the normalized Shannon's intense entropy in units of entropy per character). But technically, if only 100 out of 256 types of bytes, then n = 100, not 256.

H is the "intense" entropy, i.e. it corresponds to a symbol that is similar to specific entropy in physics, which is entropy per kg or per mole. A regular "extensive" file entropy, similar to physics, S is S = N * H, where N is the number of characters in the file. H would be exactly the same as part of the ideal volume of gas. Information entropy cannot simply be made exactly equal to physical entropy in a deeper sense, because physical entropy allows for “ordered” as well as disordered mechanisms: physical entropy comes out more than completely random entropy (for example, a compressed file). One aspect is different For an ideal gas there is an additional 5/2 factor for this: S = k * N * (H + 5/2), where H = possible quantum states per molecule = (xp) ^ 3 / hbar * 2 * sigma ^ 2 where x = field width, p = total undirected momentum in the system (calculated from kinetic energy and mass per molecule) and sigma = 0.341 in accordance with the uncertainty principle, giving only the number of possible states within the 1st deck.

A little math gives a shorter form of normalized extensive entropy for a file:

S = N * H / log (n) = sum (count_i * log (N / count_i)) / log (n)

The units of this are “entropy” (which is not really a unit). It is normalized as a better universal measure than the "entropy" units of N * H. But it also cannot be called "entropy" without explanation, because a normal historical convention is mistakenly called H "entropy" (which contradicts the explanations made in Shannon's text) .

+8

zawy Jan 21 '16 at 12:43 on

source share

If you use the entropy of information theory, remember that it makes sense not to use it in bytes. Say, if your data consists of floats, you should instead distribute the probability to these floats and calculate the entropy of this distribution.

Or, if the contents of the file is a Unicode character, you should use them, etc.

+5

bayer Jun 13 '09 at 11:33

source share

Computes the entropy of any string unsigned chars of size "length". This is mainly refactoring the code found at http://rosettacode.org/wiki/Entropy . I use this for a 64-bit IV generator, which creates a 100000000 IV container without duplicates and an average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html

 #include <string> #include <map> #include <algorithm> #include <cmath> typedef unsigned char uint8; double Calculate(uint8 * input, int length) { std::map<char, int> frequencies; for (int i = 0; i < length; ++i) frequencies[input[i]] ++; double infocontent = 0; for (std::pair<char, int> p : frequencies) { double freq = static_cast<double>(p.second) / length; infocontent += freq * log2(freq); } infocontent *= -1; return infocontent; }

+1

iggy_pop Dec 10 '14 at 21:50

source share

Re: I need everything to make assumptions about the contents of the file: (plain text, markup, compression, or some binary files, ...)

As others noted (or were confused / distracted), I think you are actually talking about metric entropy (entropy is divided by the length of the message). Read more ... Entropy (information theory) - Wikipedia .

jitter comment referring to Scan data for entropy anomalies is very important for your main purpose. This ultimately refers to libdisorder (the C library for measuring the entropy of bytes) . It would seem that this approach will give you more information to work with, since it shows how metric entropy changes in different parts of the file. See this graph of how the entropy of a block of 256 consecutive bytes from a 4 megabyte jpg image (y axis) changes for different offsets (x axis). At the beginning and at the end, the entropy is lower, since it partially works, but for most of the file it is about 7 bits per byte.

Source: https://github.com/cyphunk/entropy_examples . [Please note that this and other graphics are available through the new license http://nonwhiteheterosexualmalelicense.org ....]

More interesting is the analysis and similar graphs in the Analysis of the byte entropy of a FAT-formatted disk | GL.IB.LY

Statistics like max, min, mode and standard deviation of metric entropy for the entire file and / or the first and last blocks of it can be very useful as a signature.

This book also seems relevant: Detecting and Recognizing File Masquerades to Protect Email and Data - Springer

+1

nealmcb Feb 11 '16 at 20:05

source share

Without any additional information, the entropy of a file (by definition) is equal to its size * 8 bits. The entropy of a text file is approximately equal to * 6.6 bits if:

each character is equally probable
95 bytes are in the byte
log (95) / log (2) = 6.6

The entropy of a text file in English is estimated to be approximately 0.6 to 1.3 bits per character (as described here ).

In general, you cannot talk about the entropy of a given file. Entropy is a file set property.

If you need entropy (or entropy per byte, to be precise), the best way is to compress it with gzip, bz2, rar or any other strong compression, and then split the compressed size into uncompressed size. This would be a great estimate of entropy.

Calculation of the entropy byte by byte at the suggestion of Nick Dandulakis gives a very low rating, since it assumes that each byte is independent. For example, in text files it is much more likely to have a small letter after the letter than a space or punctuation after the letter, since words usually have a length of more than 2 characters. Thus, the probability of the next character in the range az is correlated with the value of the previous character. Do not use a rough estimate of Nick for any real data, use gzip compression ratio.

0

user2622016 Dec 31 '14 at 13:02

source share

Nick Dandoulakis · Accepted Answer · 2009-06-13 11:05

At the end: Calculate the "average" value for the array.
Initialize the counter with zero, and for each of the entries in the array: add the difference in the entries to the "average" counter.

With some changes, you can get the Shannon entropy:

rename "medium" to "entropy"

(float) entropy = 0 for i in the array[256]:Counts do (float)p = Counts[i] / filesize if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2

Edit: As Wesley pointed out, we need to divide the entropy by 8 to adjust it in the range 0., 1 (or, alternatively, we can use the 256 logarithmic base).

How to calculate file entropy?

More articles: