Is this the right way to find the checksum?

Question

Is this the right way to find the checksum?

I am trying to calculate a checksum for some data. This is the code:

#include <stdio.h> #include <string.h> int main() { char MyArray[] = "my secret data"; char checksum = 0; int SizeOfArray = strlen(MyArray); for(int x = 0; x < SizeOfArray; x++) { checksum += MyArray[x]; } printf("Sum of the bytes for MyArray is: %d\n", checksum); printf("The checksum: \n"); checksum = (checksum ^ 0xFF); printf("%d\n",checksum); }

Output:

 Sum of the bytes for MyArray is: 70 The checksum: -71

Modification in the code:

 #include <stdio.h> #include <string.h> int main() { char MyArray[] = "my secret data"; char checksum = 0; // could be an int if preferred int SizeOfArray = strlen(MyArray); for(int x = 0; x < SizeOfArray; x++) { checksum += MyArray[x]; } printf("Sum of the bytes for MyArray is: %d\n", checksum); //Perform bitwise inversion checksum=~checksum; //Increment checksum++; printf("Checksum for MyArray is: %d\n", checksum); }

Output:

 Sum of the bytes for MyArray is: 70 Checksum for MyArray is: -70

Why change the value of the checksum? Will different algorithms provide different checksums?

How would it be useful to use the final value? Well, actually I don’t quite understand the checksum and its use in data verification. I searched the web, found many articles, but still not clear. Hopefully today I understand the checksum.

+8

c checksum

highlander141 Dec 25 '15 at 6:19

source share

3 answers

You need to understand what a checksum is before you think about how you create it. Suppose you are sending data through an unreliable communication channel, such as a network connection. You must ensure that there is no interference that has affected your message.

One way to do this is to send the message twice and check the differences (indeed, there is a fairly small chance that the same error will occur during the transmission of both messages). However, this requires the use of a fairly large bandwidth (sending a message twice).

A more efficient approach is to compute the value based on the message and bind it to the message. Then the recipient applies the same function and checks if this value matches.

To get a more intuitive example, the checksum of the book may be the number of pages. You buy a book from the library and count its pages. If the page count is not as expected, a problem occurs.

You implement a certain checksum function (LSB sum), which is great. All checksum functions have some properties that you should be aware of, but the fact is that there is no correct way to calculate the checksum. There are many functions that can be used for this purpose.

+7

Paul92 Dec 25 '15 at 6:30

source share

A checksum is usually used to detect data changes. Communications, encryption / signature, etc. Checksums are used everywhere.

How can a checksum be used?

it detects a change of 1 bit, for example
it even detects changes when changing more than 1 bit.

This may seem paradoxical, but when only 1 bit changes, your checksum will work. However take

 (A) checksum += 0x11 instead of 0x10

and later

 (B) checksum += 0x30 instead of 0x31

In (A) the checksum will be -1 ... and in (B) it will be +1. Plus and minus 1 == 0. Two errors will not be detected by your checksum.

Basically, the quality of the checksum depends on

along the length of the checksum (the larger the checksum, the more it will cover larger data without "looping") (one byte can have only 256 checksums, 2 bytes - 65536, note that in the above case with your algorithm, which will not change the result)
the quality of the checksum calculation to prevent as much as possible so that the two differences cancel each other out.

There are many algorithms available. This answer to SO is a good start.

+3

Ring Ø Dec 25 '15 at 6:30

source share

Ian · Accepted Answer · 2015-12-25T15:05:27+0000

This is the beauty of the checksum algorithm: the way you create the checksum and the way you check is somehow symmetrical !

About the checksum

A checksum is usually used to verify data integrity , especially over a noisy / unrealized communication channel. Thus, it is mainly used to detect errors . That is, to know whether the data obtained is correct or not .

This is very different from, for example, fixing bugs . Since its use is not only to check for an error, but also to correct it, usually the error correction data increases in size quite proportionally > with its original data (since the more data you have, the more overhead you need to recover).

Thus, in this sense, a good checksum algorithm is usually the one that uses the least amount of overhead and detects an error , but with high immunity to false results.

And with this understanding, the problem lies, since the reliability of the checksum really depends not only on the algorithm , but also depends on the characteristics of the channel . Some channels may be subject to certain types of errors, while others may be affected by others. In general, there are some checksums that are known to be more reliable and popular than others (one of my favorites is CRC - cyclic redundancy check ). But for each scenario there is no ideal checksum; it really depends on the usage and scenario.

But still you can measure the reliability of the checksum algorithm. And there is a mathematical way to do this, which, I think, is beyond the scope of this discussion. Thus, some checksums in these feelings can be said to be weaker than others. The checksums that you showed in your question are also weak.

About the code.

XOR with 0xFF for 8-bit is absolutely equivalent to binary-inverted values, and it's not that hard to see.

XOR with 0xFF

 1110 0010 1111 1111 --------- XOR 0001 1101 //notice that this is exactly the same as binary inverting!

Thus, when you execute XOR with 0xFF and ~checksum , you get the same result -71 (and since your data type is char , it has a negative number). Then you increase it by 1, so you get -70.

About 2 'Additions

Two additions - a mathematical operation on binary numbers, as well as a binary numeric numeric representation based on this operation. Its widespread use in computing makes it the most important example of the radix add-on. ( wikipedia )

In other words, the 2 'addition is to find a negative representation of the value (in Computer Binary), and his method, as you did it right, inverts all its bits and then adds to it. That's why you get -70 on 2 ', complementing 70 . But this does not mean that the 2 'padding and XOR 0xFF are the same , and as you can see from the example, this is really not the same.

What XOR by 0xFF does in 8-bit data is simply equivalent to changing all its bits. This does not add to her .

About how to read the add / read checksum

Since the checksum is used to determine the integrity of the data (whether it is modified or not), people are trying to find the best practice for this. What you do is actually get a checksum on a 2 'padding or XOR with 0xFF.

And here is what they do:

For a 2 'checksum. Let them say that the length of your message is N. Since you get a sum of N numbers, say 70. Then, adding a 2'complex checksum (that is, -70),. On the receiver side, you just need to sum all the N + 1 messages, including the checksum, and you should get 0 if the message does not change . Here's how to use the 2 'checksum correctly.
For XOR with 0xFF Again, with the same example as the previous one, you should get -1 if you summarize all the N + 1 messages, including the checksum. And since the hexadecimal representation of -1 is 0xFF in an 8-bit subscription, therefore, by XOR of the result (-1) with 0xFF, you should get 0xFF ^ 0xFF = 0 if the message does not contain an error

Therefore, in both cases, you just need to check if the message contains an error or not, checking if the end result is 0 (no error) or not! And this is usually true for checksum algorithms!

This is the beauty of the checksum algorithm: the way you create the checksum and the way you check is somehow symmetrical !

Is this the right way to find the checksum?

More articles: