I was two years late, so please consider this despite a few votes.
Short answer: use my 1st and 3rd bold equations below to find out what most people think when they say “entropy” of a file in bits. Use only the 1st equation if you want the Shannon N entropy, which is actually an entropy / symbol, as he stated 13 times in his article, which most people are not aware of. Some online entropy calculators use this, but Shannon N is “specific entropy” rather than “total entropy”, which caused so much confusion. Use the 1st and 2nd equations if you need an answer between 0 and 1, which is normalized entropy / symbol (this is not a bit / symbol, but a true statistical measure of the "entropy nature" of the data, allowing the data to choose its own log base instead arbitrary use 2, e or 10).
There are 4 types of file (data) entropy of N characters long with n unique character types. But keep in mind that, knowing the contents of the file, you know the state it is in, and therefore S = 0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy / nature of this source. If you use the following in a file, it’s more accurate to say that it estimates the expected entropy of other files from this source.
- Shannon (specific) entropy H = -1 * sum (count_i / N * log (count_i / N))
where count_i is the number of characters I met in N.
Units are bit / symbol, if log is base 2, nats / symbol if natural log. - Normalized Specific Entropy: H / log (n)
Units are entropy / symbol. Ranges from 0 to 1. 1 means that each character occurs equally often, and about 0 - all characters except 1 occur only once, and the rest of a very long file is a different character. The journal is in the same database as H. - Absolute Entropy S = N * H
Units are bits if log is base 2, nats if ln ()). - Normalized Absolute Entropy S = N * H / log (n)
The unit is "entropy", varies from 0 to N. The journal is in the same base as H.
Although the latter is the surest "entropy", the first (the entropy of Hannon Hannon H) is what all books call "entropy" without (necessary IMHO) qualifications. Most of them do not specify (for example, Shannon) that it is a bit / symbol or entropy per symbol. The challenge of H-entropy is too loose.
For files with the same frequency of each character: S = N * H = N. This applies to most large bit files. Entropy does not do any data compression and, therefore, does not know any patterns completely, therefore 000000111111 has the same H and S as 010111101000 (6 1 and 6 0 in both cases).
Like other users, using a standard compression procedure such as gzip and before and after division will give a better idea of ​​the amount of pre-existing “order” in the file, but it is biased against data that fits the compression scheme better. There is no completely optimized general purpose compressor that we can use to determine the absolute "order".
Another thing to consider: H changes if you change the way you express data. H will be different if you choose different groupings of bits (bits, nibbles, bytes or hexadecimal). Thus, you divide by log (n), where n is the number of unique characters in the data (2 for binary, 256 bytes), and H will vary from 0 to 1 (this is the normalized Shannon's intense entropy in units of entropy per character). But technically, if only 100 out of 256 types of bytes, then n = 100, not 256.
H is the "intense" entropy, i.e. it corresponds to a symbol that is similar to specific entropy in physics, which is entropy per kg or per mole. A regular "extensive" file entropy, similar to physics, S is S = N * H, where N is the number of characters in the file. H would be exactly the same as part of the ideal volume of gas. Information entropy cannot simply be made exactly equal to physical entropy in a deeper sense, because physical entropy allows for “ordered” as well as disordered mechanisms: physical entropy comes out more than completely random entropy (for example, a compressed file). One aspect is different For an ideal gas there is an additional 5/2 factor for this: S = k * N * (H + 5/2), where H = possible quantum states per molecule = (xp) ^ 3 / hbar * 2 * sigma ^ 2 where x = field width, p = total undirected momentum in the system (calculated from kinetic energy and mass per molecule) and sigma = 0.341 in accordance with the uncertainty principle, giving only the number of possible states within the 1st deck.
A little math gives a shorter form of normalized extensive entropy for a file:
S = N * H / log (n) = sum (count_i * log (N / count_i)) / log (n)
The units of this are “entropy” (which is not really a unit). It is normalized as a better universal measure than the "entropy" units of N * H. But it also cannot be called "entropy" without explanation, because a normal historical convention is mistakenly called H "entropy" (which contradicts the explanations made in Shannon's text) .