Best method for enumerating huge HyperLogLogs in Redis

Question

Best method for enumerating huge HyperLogLogs in Redis

The task is simple: I need to find the optimal strategy for implementing accurate HyperLogLog unions based on their Redis representation - this includes processing their sparse / dense representations if the data structure is exported for use elsewhere.

Two strategies

There are two strategies, one of which seems a lot easier. I looked at the actual source of Redis, and I had problems (I'm not very big in C), figuring out whether it is better to use my built-in structures / routines in terms of accuracy and efficiency or to develop my own. For what it's worth, I'm ready to sacrifice space and to some extent errors (stdev + -2%) in the pursuit of efficiency with extremely large sets.

1. The principle of inclusion

The simplest of the two - in fact, I would simply use the lossless join (PFMERGE) in combination with this principle to calculate the estimate of the overlap. Tests show that in many cases this works reliably, although I have problems getting accurate information about perfect efficiency and accuracy (in some cases errors of 20-40% are possible, which is unacceptable in this case).

Basically:

aCardinality + bCardinality - intersectionCardinality

or, in the case of multiple sets ...

 aCardinality + (bCardinality x cCardinality) - intersectionCardinality

seems to work in many cases with good accuracy, but I don't know if I trust him. Although Redis has many built-in low power factor modifiers designed to circumvent the known HLL problems, I don’t know if the problem of wild inaccuracies (using inclusion / exclusion) with sets with high scatter in size remains ...

2. Jaccard Index / MinHash Intersection Index

This method seems more interesting, but part of me feels that it can computationally overlap with some of the existing Redis optimizations (i.e. I don't implement my own HLL algorithm from scratch).

With this approach, I would use a random selection of bins with the MinHash algorithm (I don't think the LSH implementation is worth it). This will be a separate structure, but using minhash to get the Jaccard index for sets, you can effectively multiply the join power by that index for a more accurate calculation.

The problem is that I am not very good at HLL, and although I would like to delve into a Google document, I need a viable implementation in a short time. Most likely, I will not pay attention to some basic considerations, either from the existing Redis optimizations, or in the algorithm itself, which allows us to calculate cheap estimates of the intersection with rather weak confidence boundaries.

so my question is:

How is it most efficient to get a calculated-cheap estimate of the intersection of N huge (billions) sets using redis if I am willing to sacrifice space (and to a small degree, accuracy)?

+8

hashtable hash redis hyperloglog minhash

Julian May 07 '15 at 16:20

source share

2 answers

frugalcoder · Answer 1 · 2015-08-21T06:22:33+0000

Read this article a while ago. Probably answer most of your questions. The inclusion principle inevitably combines error fields with a large number of sets. The Min-Hash approach is the way to go.

http://tech.adroll.com/media/hllminhash.pdf

otmar · Answer 2 · 2016-11-02T11:38:01+0000

There is a third strategy for estimating the intersection size of any two sets defined as HyperLogLog thumbnails: Maximum Credibility Assessment.

See http://oertl.imtqy.com/hyperloglog-sketch-estimation-paper/ for more information.

Best method for enumerating huge HyperLogLogs in Redis

Two strategies

1. The principle of inclusion

2. Jaccard Index / MinHash Intersection Index

More articles: