I once worked for a company that sold a client-server application, which was mainly a means of transferring and synchronizing files. Both the client and server were custom applications that we developed.
We had a constant mistake, which was very difficult to duplicate in the laboratory. Our server could handle a certain number of incoming client connections in each field, so many of our clients will "cluster" several servers together to process large user groups. The end-user data for the cluster was on a file server, which they all shared. In this cluster configuration, an error occurred that occurred during boot, where we would get a low file system error code when calling file sharing with one of the files at the end of the file. No one could make it repeat reliably in the laboratory, and even when they could, they could not narrow down what was happening.
(I forgot the exact error, maybe it was 59 ERROR_UNEXP_NET_ERR or maybe 65 ERROR_NETWORK_ACCESS_DENIED . As far as I remember, not even one of the documented error codes that you should have received from the API that we called, which was usually a lock or unlock in File section).
Since it was about the connection between the server and file storage, and I was a "network transport" guy, I was instructed to take a look at it. Many others looked at him with no luck.
The only thing I had was that I knew where the error was found in the code, but not what to do about it. So I needed to find the root cause. Therefore, I created a suitable hardware environment for its duplication, and I placed a custom assembly of server software that used a section of the code in question.
The toolkit was as follows: I added a test for an unpleasant error code and asked him to call a piece of code to send a UDP packet to a predefined network address when an error occurs. The UDP packet contained a unique string in it.
Then I install the package sniffing tool on the network. (At that time I used Microsoft Network Monitor ). I positioned it where he could “see” this UDP packet when it was sent, as well as all the communication between the cluster servers and the file server.
Most good sniffers have a mode in which you can capture it until you see a certain section of traffic, and then stop. I turned on this mode and set it to search for a UDP packet that will send my code. The goal was to complete the batch capture of all file server traffic right before the error occurred. The most recent network packets to and from the system in which the UDP packet originated are likely to be a big key to what happens.
I set up the stress test configuration and went home for the weekend.
When I returned on Monday, this is my data. Sniffer stopped as expected, after many hours of work, and kept the grip. Having studied the capture, I found that the Server Message Block or SMB (aka CIFS aka SAMBA ) the connection between our server and the file server actually went to the TCP level due to excessive load on the server. Since all Microsoft materials are highly multi-layered, it will leak back through the file sharing stack as an “unexpected” error instead of returning a more understandable error code that said “hey, you lost the connection at the TCP level”.
I did a little work on TCP settings for Windows and looked at the default values for the version of Windows that we used (probably NT 4 in that era) were not too generous. This allowed only a very small number of failures in the TCP connection and the boom, you were dead. Once you lost your connection to the SMB file server, all of your file locks were toasts and could not be restored.
So, I ended up writing an application for a user guide that explained how to change the TCP settings in Windows to make your cluster server more tolerant of high-load situations. And that’s all. The bug fix was a zero code change, but just additional documentation on how to properly configure the OS for use with this product.
What did we learn?
- Be prepared to run modified versions of your code to investigate the problem.
- Consider using non-traditional tools to solve the problem (sniffers).
- Not all bug fixes require code changes.
- Sometimes you can diagnose a mistake when you have a beer at home