Incorrect bytes are sometimes written to disk. Hardware issues?

Question

Incorrect bytes are sometimes written to disk. Hardware issues?

I wrote a UDP based transfer protocol using C ++ 11 (VS2013). It flashes quickly - and works fine 99.9% of the time.

But I noticed several times that the wrong bytes are written to disk (Samsung 250 GB SSD 850 EVO) - or at least it seems to be so.

Here, basically, something happens when I transfer the test file to 6 GB:

The file is split into smaller UDP data packets - 64 KB in size. (The network layer disables and reassembles UDP datagrams into a larger packet).
The client sends datapackage (udp) to the server - the payload is encrypted using AES256 (OpenSSL) and contains + data metadata. The payload also contains SHA256 hashing of the entire payload - replenish the UDP checksum as an additional integrity check.
The server receives the data packet, sends the "ACK" packet back to the client, and then calculates the SHA256 hash. The hash is identical to the hash client - everything is fine
The server then writes the packet data to disk (using fwrite instead of streams due to huge performance differences). The server processes only one packet at a time - and each pointer file has a mutex protector that protects it from being closed by another worker thread, which closes file pointers that have been inactive for 10 seconds.
The client receives UDP "ACK" packets and resends packets that were not allocated (this means that they did not). Incoming ACK packet rate controls client send rate (e.g. congestion control / throttling). The order of the packets received on the server does not matter, since each packet contains the Position value (where the data should be written in the file).

After transferring the entire file, I make a full SHA256 hash of a 6 GB file both on the server and on the client, but, to my horror, I observed twice in the last few days that the hash is NOT the same (when performing about 20 test transfers )

After comparing files in Beyond Compare, I usually find that there is one or two bits (in a 6 GB file), which is incorrect on the server.

See screenshot below:

Server code - called after checking the DataPackage hash

void WriteToFile(long long position, unsigned char * data, int lengthOfData){ boost::lock_guard<std::mutex> guard(filePointerMutex); //Open if required if (filePointer == nullptr){ _wfopen_s(&filePointer, (U("\\\\?\\") + AbsoluteFilePathAndName).c_str(), L"wb"); } //Seek fsetpos(filePointer, &position); //Write - not checking the result of the fwrite operation - should I? fwrite(data, sizeof(unsigned char), lengthOfData, filePointer); //Flush fflush(filePointer); //A separate worker thread is closing all stale filehandles //(and setting filePointer to NULLPTR). This isn't invoked until 10 secs //after the file has been transferred anyways - so shouldn't matter }

So to summarize:

char * was right in memory on the server - otherwise the SHA256 Hash servers would fail - right? (a hash collision with sha256 is highly unlikely).
Corruption seems to occur when writing to disk. Since about 95,000 of these 64-kilogram packets are written to send a 6 GB file, and this happens only once or twice (when it happens at all) - this is a rare occurrence

How can this happen? Is my hardware (bad ram / drive) to blame for this?

Do I need to read from disk after writing and do, for example, memcmp to be 100% sure that the correct bytes are written to disk? (Oh boy, what a defeat will be ...)

+6

c ++ ram hardware sha256

Njål Arne Gjermundshaug Aug 18 '16 at 8:49

source share

1 answer

Njål Arne Gjermundshaug · Accepted Answer · 2016-10-26T06:52:32+0000

On my local computer, it turned out that this is a problem with RAM. Found by running memtest86.

However - I changed the code for our software that runs on our production servers - making it read from disk to make sure that the correct bytes are actually written. These servers write about 10 TB to disk every day - and a week after starting a new code, an error occurred once. The software fixes this by writing and checking again, but it's still interesting to see that it really happened.

1 bit of 560000000000000 bits was not written correctly to disk. Amazing

I most likely ran memtest86 on this server to find out if this is a RAM problem - but I am not really worried about this because the file integrity is more or less ensured and there are no signs of hardware problems on the servers otherwise case.

So, if file integrity is extremely important to you (for example, for us), then do not trust your hardware 100% and check the read / write operations. Anomalies may be an early sign of HW problems.

Incorrect bytes are sometimes written to disk. Hardware issues?

More articles: