Is there a way to do a probabilistic check of time equality for collection types?

Question

Is there a way to do a probabilistic check of time equality for collection types?

Problem

I wondered how to make an effective comparison of two types of collections (lists, sets, maps, etc.). It should be noted that structural equality is desirable, rather than link-based equality.

Usually, you need to sort through all the elements of the collection and make a comparison between them at the price of O (1) per comparison, which gives an amazing comparison time O (n).

This can affect the hash table of lists where collision checking is quite expensive or a contract design is used (for example, comparison and old collection with new ones).

Direction of the current decision

I have ways to identify quick fixes, but they all seem experienced / non-deterministic. The idea is that you can use some unique hash of all the elements that can be stored and compared. A good hashing algorithm should provide sufficient entropy, so there is little chance of a collision.

This hash-based comparison method can be enhanced by using some constant-time comparison in the list (for example, comparing the first 10 elements). Two lists with the same elements at the beginning and using a good hashing algorithm should theoretically give a somewhat unique comparison.

Question

Is it possible to create some kind of comparison on the time constant (both generalized and specialized for some time, for example, integers), and can it be achieved using the unique hash method?

Update

To clarify the issue, I don’t need a perfect equality check, but a quick “before equality” check as a way to speed up a real equality check. Although many hash code implementations are useful for comparison, I'm also interested in (ordered) comparison.

+7

performance design algorithm

Yet Another Geek Mar 26 '12 at 18:03

source share

8 answers

Use hash-based comparisons.

Hash (SetA) vs Hash (SetB).

PS: you need to sort (or any other determinate order) elements in sets before computing the hash. It is possible that the hashes may be the same, but not collections (due to the collision of the hashes), but the chances of this are pretty low.

PS: PS: I assume that collections are static (or almost static). In this case, you can pre-calculate the hashes during the creation of the collection itself. Thus, for each comparison, this is O (1). Otherwise, as Groo said, use XOR-based hashing, which is pretty efficient.

Following actions. Using information theory, it can be proved that if each of X and Y can take 2 ^ n unique values, you need to perform at least O (n) comparisons. This is not enough. What hashes give you the ability to compare effectively.

+2

Elkamina Mar 26 '12 at 18:07

source share

You can use color filters for this task [for sets]. Each set will also have a flowering filter attached to it.

If the two filters are identical, the structures are probably identical .

If the two filters are not identical, the steps are definitely different from each other.

Upper side:
There are no false negatives. If the filters are different, the structures are different.

Down side:
You may have false positives. You will need an additional check [full workaround] to make sure that the 2 structures are really identical.

Note that the false positive rate depends on the size of the flowering filter - the more - the less false positive results you get.

Also note: since flowering filters are actually bits, comparing two color filters can be implemented very efficiently.

+1

amit Mar 26 '12 at 18:14

source share

Here is a very useful (and detailed) discussion of the topic, including reference implementations for several types of collections.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2009/n2986.pdf

In the general case, computing permutations is a quadratic operation. However, given the two unordered containers that use the same hash and key equivalence functions, the elements will be divided into key equivalence groups, which make the comparison much more efficient.

+1

Eric J. Mar 26 '12 at 18:47

source share

If you use a safe hash function, the probability of collision is vanishingly small (and if you use a modern hash function, you can write an article if you encounter :-)).

If your collection is implemented as a tree, you can save the hash calculated from the leaves to the root at the cost, a constant factor multiplied by the cost of updating the tree, which you must do anyway. Unfortunately, the constant factor in calculating a safe hash is probably quite large. Unfortunately, you need two collections with the same objects that have the same tree structure. This works with http://en.wikipedia.org/wiki/Radix_tree , but not with typical balanced trees where history or updates affect the tree structure.

Ideal hash functions are usually configured as ideal for a particular collection, which probably won't work in your case. If the hash function displays the numbers 1..N, then for given objects N + 1 there will always be at least one collision.

0

mcdowella Mar 26 '12 at 18:12

source share

No, this is theoretically impossible. If you have 32 bit hash values, you can only distinguish between 2 ^ 32 options, but lists can grow arbitrarily large. With the same argument, in runtime <= k loops, you can do rough comparisons.

If you do not want to guarantee 100%, you can use the hash function, of course. However, I would not invent a bicycle, which usually leads to worse results, and then uses standard libraries . For example, you may forget:

let the length of the list strongly affect the hash function.
override equals (), too
obey all contracts for equals () and hashCode () (see Josh Bloch Effective Java or http://www.ibm.com/developerworks/java/library/j-jtp05273/index.html , http: //www.technofundo .com / tech / java / equalhash.html ).

0

Davefar Mar 26 '12 at 18:12

source share

I would go with

 hash(structure) := hast(item1) ^ hash(item2) ^ ... ^ hash(item_n)

Depending on the hash function (and, above all, on its output size) this will give you a good false positive probability. It does not give false negatives and is easily implemented with insertion and removal in a short constant time. They beat flowering filters in that the false positive probability does not depend on the number of elements.

For arrays or lists, how likely are arrays with the same content in a different order? If they are, you can easily make a hash depending on the position of the position:

 hash(structure) := hast(item1, 1) ^ hash(item2, 2) ^ ... ^ hash(item_n, n)

In this case, the deletion and insertion can be O (1) at the end of the array. Random inserts in the middle are more complicated, but again, they are still O (n) for arrays.

0

tiwo Apr 1 '12 at 2:31

source share

My first thought after reading your question is what you mean by "probabilistic." Do you think of probabilistic methods as a way to get an exact (countable) answer that is correct (without errors)? Or are you ready to make some mistake as a result?

In the latter case, you can use asymptotic "equivalence" when comparing data after applying the log function. Consider Linear Counting :

Create a zero initialized bitmap b of size m
Choose the hash function f
Apply f to each input, getting the value v
Set the bitmap in position v to 1

To calculate the counter, the formula:

n = - m * ln ( Un / m )

Where:

n → approximate counter
Un → Number of zero bits in m

For the appropriate size , see the link above in the original paper. Also, see this recent blog post that also includes HyperLogLog:

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

0

Garen Apr 08 '12 at 2:29

source share

Groo · Accepted Answer · 2012-03-26T19:26:47+0000

I took a couple of minutes to write such a collection class in C #, source below. I used the generic System.Collections.ObjectModel.Collection<T> because it easily overrides its functionality.

Did not test it at all, but this should be a solid start IMHO. Note that UpdateHash takes indexes into account (which makes the hash function a little better), while the HashedSet<T> analog HashedSet<T> skip this part.

In addition, due to the reversibility of the XOR operator, the O(1) complexity is fulfilled when the hash is converted to add / remove. If you need a better hash, these operations will grow to O(n) , so I recommend profiling and then deciding which is best.

 public class HashedList<T> : Collection<T>, IEquatable<HashedList<T>> { private int _hash; private void UpdateHash(int index, T item) { _hash ^= index; if (item != null) _hash ^= item.GetHashCode(); } #region Overriden collection methods protected override void InsertItem(int index, T item) { UpdateHash(index, item); base.InsertItem(index, item); } protected override void RemoveItem(int index) { UpdateHash(index, this[index]); base.RemoveItem(index); } protected override void ClearItems() { _hash = 0; base.ClearItems(); } protected override void SetItem(int index, T item) { UpdateHash(index, this[index]); UpdateHash(index, item); base.SetItem(index, item); } #endregion #region Value equality public bool Equals(HashedList<T> other) { if (other == null) return false; if (object.ReferenceEquals(this, other)) return true; if (other.Count != this.Count) return false; if (other._hash != this._hash) return false; return CompareElements(other); } private bool CompareElements(HashedList<T> other) { for (int i = 0; i < this.Count; i++) { if (this[i] == null) { if (other[i] != null) return false; } if (this[i].Equals(other[i]) == false) return false; } return true; } public override bool Equals(object obj) { var hashed = obj as HashedList<T>; if (hashed != null) return Equals(hashed); return base.Equals(obj); } public override int GetHashCode() { return _hash; } #endregion }

You can also argue that object.Equals should return true if any implementation of IList<T> with the same elements is passed, but since their hash codes will then be different, this will break the consistency. This is the recommended implementation for object.Equals IIRC.

Is there a way to do a probabilistic check of time equality for collection types?

Problem

Direction of the current decision

Question

Update

More articles: