There are various approaches to the two main categories, each of which, as a rule, has its own advantages and disadvantages in terms of efficiency and productivity. It is probably best to choose the simplest algorithm for any application and use only more complex options if necessary for any situation.
Note that these examples use EqualityComparer<T>.Default since it will work cleanly with null elements. You can do better than zero for zero if you want. If T is restricted for structuring, this is also not necessary. Optionally, you can EqualityComparer<T>.Default search for EqualityComparer<T>.Default from the function.
Commutative Operations
If you use hash codes for individual records that are commutative, this will lead to the same end result regardless of order.
There are several obvious options for numbers:
Xor
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; foreach (T element in source) { hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element); } return hash; }
The disadvantage of this is that the hash for {"x", "x"} is the same as the hash for {"y", "y"}. If this is not a problem for your situation, perhaps this is the easiest solution.
addition
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; foreach (T element in source) { hash = unchecked (hash + EqualityComparer<T>.Default.GetHashCode(element)); } return hash; }
The overflow here is good, hence the explicit unchecked context.
There are some other unpleasant cases (for example, {1, -1} and {2, -2}, but with a higher probability everything will be fine, especially with strings. In the case of lists that can contain such integers, you can always implement a custom a hash function (possibly one that takes the repeat index of a certain value as a parameter and, accordingly, returns a unique hash code).
Here is an example of such an algorithm that quite effectively copes with the above problem. It also has the advantage of significantly increasing the distribution of generated hash codes (see the article at the end for some explanation). A mathematical / statistical analysis of exactly how this algorithm generates the “best” hash codes would be quite advanced, but testing it in a wide range of input values and plotting the results should confirm this quite well.
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; int curHash; int bitOffset = 0;
multiplication
Which has few advantages over addition: small numbers and a combination of positive and negative numbers can lead to a better distribution of hash bits. As a negative value for the offset, this “1” becomes a useless record that does not contribute anything, and any null element results in zero. You can set a special case to zero so as not to cause this serious flaw.
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 17; foreach (T element in source) { int h = EqualityComparer<T>.Default.GetHashCode(element); if (h != 0) hash = unchecked (hash * h); } return hash; }
Order first
Another basic approach is to clean up first and then use whatever hash function you like. Order itself does not matter if it is sequential.
public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; foreach (T element in source.OrderBy(x => x, Comparer<T>.Default)) {
This has some significant advantages in that the union operations possible in f can have significantly better hashing properties (for example, bit allocation), but this happens at a significantly higher cost. Sort is O(n log n) and the required copy of the collection is a memory allocation that you cannot avoid if you want to avoid changing the original. GetHashCode implementations should usually completely avoid allocation. One of the possible implementations of f would be similar to that given in the last example in the Adding section (for example, any remaining number of bit shifts to the left, followed by multiplication by a prime number - you could even use consecutive primes at each iteration at no additional cost, since they should only be generated once).
However, if you have dealt with cases where you can calculate and cache the hash and amortize the cost of many GetHashCode calls, this approach can lead to excellent behavior. In addition, the latter approach is even more flexible because it avoids the need to use GetHashCode for elements if it knows their type, and instead use byte operations for them to provide even better hash distribution. This approach is likely to be useful only when performance has been identified as a significant bottleneck.
Finally, if you want to get a fairly complete and fairly non-mathematical overview of the subject of hash codes and their overall performance, these blog posts would be useful for reading, in particular the post “Implementing a Simple Hashing Algorithm (pt II)”.