The task is simple: I need to find the optimal strategy for implementing accurate HyperLogLog unions based on their Redis representation - this includes processing their sparse / dense representations if the data structure is exported for use elsewhere.
Two strategies
There are two strategies, one of which seems a lot easier. I looked at the actual source of Redis, and I had problems (I'm not very big in C), figuring out whether it is better to use my built-in structures / routines in terms of accuracy and efficiency or to develop my own. For what it's worth, I'm ready to sacrifice space and to some extent errors (stdev + -2%) in the pursuit of efficiency with extremely large sets.
1. The principle of inclusion
The simplest of the two - in fact, I would simply use the lossless join (PFMERGE) in combination with this principle to calculate the estimate of the overlap. Tests show that in many cases this works reliably, although I have problems getting accurate information about perfect efficiency and accuracy (in some cases errors of 20-40% are possible, which is unacceptable in this case).
Basically:
aCardinality + bCardinality - intersectionCardinality
or, in the case of multiple sets ...
aCardinality + (bCardinality x cCardinality) - intersectionCardinality
seems to work in many cases with good accuracy, but I don't know if I trust him. Although Redis has many built-in low power factor modifiers designed to circumvent the known HLL problems, I donβt know if the problem of wild inaccuracies (using inclusion / exclusion) with sets with high scatter in size remains ...
2. Jaccard Index / MinHash Intersection Index
This method seems more interesting, but part of me feels that it can computationally overlap with some of the existing Redis optimizations (i.e. I don't implement my own HLL algorithm from scratch).
With this approach, I would use a random selection of bins with the MinHash algorithm (I don't think the LSH implementation is worth it). This will be a separate structure, but using minhash to get the Jaccard index for sets, you can effectively multiply the join power by that index for a more accurate calculation.
The problem is that I am not very good at HLL, and although I would like to delve into a Google document, I need a viable implementation in a short time. Most likely, I will not pay attention to some basic considerations, either from the existing Redis optimizations, or in the algorithm itself, which allows us to calculate cheap estimates of the intersection with rather weak confidence boundaries.
so my question is:
How is it most efficient to get a calculated-cheap estimate of the intersection of N huge (billions) sets using redis if I am willing to sacrifice space (and to a small degree, accuracy)?