How to calculate the distribution (histogram) of a large amount of data in a distributed system?

Question

How to calculate the distribution (histogram) of a large amount of data in a distributed system?

I am creating a reporting system for indicators in the fleet of an instance containing more than 100,000 front-end instances. For any request, each instance will have a response time. And I need to allocate response time to all kinds of requests throughout the fleet. For example, [Percentile 50, Percentile 90, Percentile 99, Percentile99.9 ...] from [requestType1, requestType2 ... requestType1000].

Each instance will collect response time internally. Thus, within one minute, one instance in memory collects response time lists of all requestTypes types. For example requestType1 - [1, 2, 3, 4, 1, 2], requestType2 - [2, 2, 3, 2, 1] ... So what I need to do is process this data and produce the final result.

I tried many designs, my main points of pain are the huge amount of data that I collected from each requestType request, and the cost of communication between instances. I will explain my current project below, but I also want to know if there are better projects or some fantastic algorithms can collect histograms?

Currently, the most promising is: each external interface sends its data to a random instance of the fleet of mid-level instances. In this medium-term fleet, each instance combines all the data points that it receives over a short period of time, for example. 5 seconds. (It does not have enough memory to store for a longer time). The mid-tier instance will then propagate aggregated data on the requestTypes hash value for internal instances. This means that all mid-tier instances will send data of the same request type to the same internal instance. Then in the internal instance, I can use a third-party Histogram container (CodaHale or HdrHistogram histogram) to calculate the P50, P90, P99 of the input data ... The reason I need a fleet of mid-level instances, end instances of the road, so I want all of these data was sent immediately, but did not make 100 calls to send to 100 different external instances.

The main problem I can think about this design is the relatively high complexity, and if one reverse instance does not work, I can lose all the data of some types of queries. So, for the system design part, does anyone have some better ideas?

Another way, I think, is to find a fantastic algorithm for combining existing histograms. The design above, the data I receive will be 100% accurate. But actually I can tolerate some mistakes. For example, in the CodaHale and HdrHistogram histograms, I’m sure that they don’t actually save all the data points, but they use some advanced mathematical algorithms to obtain relatively high accuracy results with very low cost. And I can use the Histogram library in front-end or medium-sized instances. But the problem is that I can get [P50, P90, P99 ...] for each instance of the external interface or mid-level instance at a low price, I could not find a way to combine them. Since the external interface can handle different types of requests, and the distribution of requests for interface instances is unknown, simply calculating the average value of ALL P50, P90, P99 will have a big inaccuracy. Does anyone have an idea how I can combine multiple CodaHale or HdrHistogram histograms together? Or are there any algorithms that will help to assemble the histograms into one?

==================================================== ========================

I have a new idea last night. Since the P50 and P90 measure the “average” of all the data, I think that simply applying a weighted average for all the P50 and P90 calculated in each instance of the middle level should be good enough. But P99, P99.9, and P99.99 measure this distant data, so the average P99 value of the subset may not be accurate.

But assuming that the data in the mid-tier instance is relatively randomly distributed, I can get 5% of the data in each mid-tier instance and send it back to the source. 5% of each intermediate layer of the middle level is 5% of the total number of points. And I have more confidence that the P80 of this 5% of the data is close to the P99 of the general data, P98 of this 5% of the data is close to the P99.9 of the general data, and P99.8 of the 5% of the data is close to the P99.99 of the general data.

Hopefully this way I can transfer only 5% of the total data, but I get the result with high accuracy. What do you think about it?

+5

algorithm aggregation codahale-metrics histogram distribution

Liu Yunao May 27 '15 at 4:34

source share

1 answer

mabn · Answer 1 · 2015-07-11T21:00:07+0000

System design:

If expensive calls, maybe you could transfer data? I do not see the real benefits of this middle tier in your description - why is the cost of the midend-> midtier interface lower than frontend-> backend?

If you are concerned about data loss, you have two options:

send events to multiple nodes. But you will need to somehow avoid duplication when processing them.
write everything to a permanent journal (Kafka can do the work here)

It all depends on the volume of events (1 / min / frontend or 10k / s / frontend) and the distance between the interface and the backend (the same data center or mobile devices → data center?).

If this is the same data center with which you could interact with the backend through a persistent log, this solves the problem of data loss. If there are many events, you can combine them on interfaces and push aggregates downstream

Aggregation:

There are various algorithms, for example. q-digest, t-digest. See Quantiles on Data Streams: An Experimental Study

It’s also worth noting that HdrHistograms can be combined

How to calculate the distribution (histogram) of a large amount of data in a distributed system?

More articles: