Checking Failover Code

I am currently working on a server application, we agreed to try and maintain a certain level of service. The level of service that we want to guarantee is as follows: if the request is accepted by the server and the server sends an acknowledgment to the client, we want to guarantee that the request will be executed even if the server fails. Since the requests can take a long time, and the confirmation time should be short, we implement this by storing the request, then sending the confirmation to the client, and then we perform various actions to complete the request. As actions are performed, they are also saved, so the server knows the status of the request at startup, as well as various coordination mechanisms with external systems to verify the accuracy of our logs.

This all works pretty well, but it's hard for us to say it with any certainty, since it is very difficult for us to test our fault-tolerant code. So far we are proposing two strategies, but none of them is completely satisfactory:

  • Ask the external process to look at the server code, and then try and kill it despite what the external process considers to be a suitable point in the test
  • Add application code that will crash certain critical critical points.

My problem with the first strategy is that the external process cannot know the exact state of the application, so we cannot be sure that we are in the most problematic moments of the code. My problem is with the second strategy, although it gives more control over errors, I do not like to have code for injecting errors in my application, even with additional compilation, etc. I am afraid that it would be too easy to deal with the discharge point error and bring it into the production environment.

+5
source share
4 answers

, , , , factory .

-, kill -9 .

. , , Solaris FreeBSD. zfs , rm .

, .

, , , , - , -, , . , .

+3

, . , , .

/ , .. .

Fault Injection , " ".

+2

. . - . , . . , OS'es , , . , root, " ". , , , , .

, , . , . , .

+2

, :)

, , ( , ...). , , , .

It also seems possible to verify that the test code does not go into production. However, I would discourage conditional compilation, but rather, I will go over with some configuration file to select the registration component.

The use of "random" kills can help detect bugs, but is not suitable for systematic testing because of its non-determinism. Therefore, I would not use it for automatic testing.

+1
source

All Articles