Calculation of the probability of a system failure in a distributed network

I am trying to build a mathematical model of file availability in a distributed file system. I posted this question in MathOverflow, but it can be classified as a CS question, so I also give it a chance.

The system works as follows: a node stores the file (encoed using erase codes) in the remit nodes r * b, where r is the replication coefficient and b is an integer constant. Files encoded in Erasure have the property that a file can be restored if at least b from the deleted nodes are available and return part of the file.

The simplest approach is to assume that all the remote nodes are independent of each other and have the same p availability. With these assumptions, file accessibility follows binomial distribution, i.e. Binomial distribution http://bit.ly/dyJwwE

Unfortunately, these two assumptions may introduce the wrong error, as shown in this article: http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf .

One way to overcome the assumption that all nodes have the same availability is to calculate the probability of each possible combination of available / inaccessible nodes and take the sum of all these results (which is a kind of proposal in the above article, more formally than that I just described). This approach can be considered as a binary tree with depth r * b, and each output is one of the possible combinations of accessible / inaccessible nodes. File availability is the same as the likelihood that you will go on vacation s> = b with available nodes. This approach is more correct, but has the computational cost of Ordo http://bit.ly/cEZcAP . In addition, this does not apply to the assumption of independence of the node.

You have good approximation ideas that introduce fewer errors than binomial distribution-approximation, but with a better computational cost than http: // bit. ly / d52MM9 http://bit.ly/cEZcAP ?

You can assume that the availability data of each node is a collection of tuples consisting of (measurement-date, node measuring, node being measured, succes/failure-bit) . Using this data, you could, for example, calculate the availability ratio between nodes and the variance of availability.

+7
computer-science time-complexity distributed high-availability
source share
2 answers

For large r and b you can use a method called Monte Carlo integration, see, for example, Monte Carlo Integration on Wikipedia (and / or SICP chapter 3.1.2) to calculate the sum. For small r and b and significantly different probabilities, node-failure p[i] exact method is superior. The exact definition of small and large will depend on several factors and is best tested experimentally.

Concrete code example . This is a very simple code example (in Python) to demonstrate how such a procedure can work:

 def montecarlo(p, rb, N): """Corresponds to the binomial coefficient formula.""" import random succ = 0 # Run N samples for i in xrange(N): # Generate a single test case alivenum = 0 for j in xrange(rb): if random.random()<p: alivenum += 1 # If the test case succeeds, increase succ if alivenum >= b: succ += 1 # The final result is the number of successful cases/number of total cases # (Ie, a probability between 0 and 1) return float(succ)/N 

The function corresponds to the binomial test example and runs the N tests, checking whether b nodes from r*b nodes are found with a failure probability p . Several experiments will convince you that you need N values ​​in the range of thousands of samples before you can get reasonable results, but in principle the complexity is O(N*r*b) . The accuracy of the result scales as sqrt(N) , i.e. To double your accuracy, you need to double N For sufficiently large r*b this method will clearly be superior.

Extension of approximation : you obviously need to develop a test case so that it takes into account all the properties of the system. You have proposed a couple of extensions, some of which can be easily implemented, while others are not. Let me give you some suggestions:

1) In the case of a clear but uncorrelated p[i] changes to the above code are minimal: in the function head, you pass an array instead of a single float p , and you replace the string if random.random()<p: alivenum += 1 with

 if random.random()<p[j]: alivenum += 1 

2) In the case of correlated p[i] you need additional information about the system. The situation that I mentioned in my comment may be this:

 A--B--C | | DE | F--G--H | J 

In this case, A may be a β€œroot node”, and a failure of node D may mean an automatic failure with a 100% probability of nodes F , G , H and J ; while crashing node F will automatically reduce G , H and J , etc. At least that was the way I meant in my comment (which is a plausible interpretation, since you are talking about a tree-like probability structure in the original question). In this situation, you need to change the code that p refers to the tree structure, and for j in ... crosses the tree, skipping the lower branches from the current node, as soon as the test fails. The resulting test is still alivenum >= b , as before,

3) This approach will fail if the network is a cyclic graph that cannot be represented by a tree structure. In this case, you first need to create graph nodes that are either dead or alive, and then run the routing algorithm on the graph to calculate the number of unique reachable nodes. This will not increase the time complexity of the algorithm, but obviously the code complexity.

4) The time dependency is a non-trivial but possible modification if you know mtbf / r (average time between failures / repairs), as this can give you the probabilities p either a tree structure or an uncorrelated linear p[i] by the sum of the exponentials. Then you will need to start the MC procedure at different times with the corresponding results for p .

5) If you only have log files (as indicated in the last paragraph), this will require a substantial modification of the approach, which goes beyond what I can do on this board. Log files must be thorough enough so that you can restore the model for the network diagram (and therefore the graph p ), as well as the individual values ​​of all nodes p . Otherwise, the accuracy will be unreliable. These log files should also be significantly longer than the timelines for failures and repairs, assumptions that may be unrealistic on real networks.

+5
source share

Assuming each node has consistent, known, and independent availability, the Divide and Win approach comes to mind.

Say you have N nodes.

  • Divide them into two sets of N/2 nodes.
  • For each side, calculate the probability that any number of nodes ( [0,N/2] ) is omitted.
  • Multiply and sum them as necessary to find the likelihood that any number ( [0,N] ) of the full set does not work.

Step 2 can be done recursively and at the top level you can summarize, since you need to find how often more than a certain number.

I do not know the complexity of this, but if I have to guess, I would say that it is below or below O(n^2 log n)


The mechanics of this can be illustrated on the terminal case. Say we have 5 nodes over time p1 ... p5 . We can split it into segments A with p1 ... p2 and B on p3 ... p5 . Then we can process them to find β€œN nodes up” times for each segment:

For A:

a_2

a_1

a_0

For B:

b_3

b_2

The final results for this stage can be found by multiplying each of A by each of B and summing accordingly.

 v[0] = a[0]*b[0] v[1] = a[1]*b[0] + a[0]*b[1] v[2] = a[2]*b[0] + a[1]*b[1] + a[0]*b[2] v[3] = a[2]*b[1] + a[1]*b[2] + a[0]*b[3] v[4] = a[2]*b[2] + a[1]*b[3] v[5] = a[2]*b[3] 
+2
source share

All Articles