Retrieving a hash of a list of strings regardless of order

I would like to write a GetHashCodeOfList() function that returns the hash code of a list of strings regardless of order. Given that 2 lists with the same lines should return the same hash code.

 ArrayList list1 = new ArrayList() list1.Add("String1"); list1.Add("String2"); list1.Add("String3"); ArrayList list2 = new ArrayList() list2.Add("String3"); list2.Add("String2"); list2.Add("String1"); GetHashCodeOfList(list1) = GetHashCodeOfList(list2) //this should be equal. 

I had a few thoughts:

  • I can sort the list first and then combine the sorted list into 1 long string and then call GetHashCode() . However, sorting is a slow operation.

  • I can get the hash of each individual line (by calling string.GetHashCode() ) in the list, then multiplying all the hashes and calling Mod UInt32.MaxValue . For example: "String1".GetHashCode() * "String2".GetHashCode * … MOD UInt32.MaxValue . But this leads to an overflow of the number.

Does anyone have any thoughts?

Thanks in advance for your help.

+61
string c # hash
Mar 21 '09 at 21:48
source share
5 answers

There are various approaches to the two main categories, each of which, as a rule, has its own advantages and disadvantages in terms of efficiency and productivity. It is probably best to choose the simplest algorithm for any application and use only more complex options if necessary for any situation.

Note that these examples use EqualityComparer<T>.Default since it will work cleanly with null elements. You can do better than zero for zero if you want. If T is restricted for structuring, this is also not necessary. Optionally, you can EqualityComparer<T>.Default search for EqualityComparer<T>.Default from the function.

Commutative Operations

If you use hash codes for individual records that are commutative, this will lead to the same end result regardless of order.

There are several obvious options for numbers:

Xor

 public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; foreach (T element in source) { hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element); } return hash; } 

The disadvantage of this is that the hash for {"x", "x"} is the same as the hash for {"y", "y"}. If this is not a problem for your situation, perhaps this is the easiest solution.

addition

 public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; foreach (T element in source) { hash = unchecked (hash + EqualityComparer<T>.Default.GetHashCode(element)); } return hash; } 

The overflow here is good, hence the explicit unchecked context.

There are some other unpleasant cases (for example, {1, -1} and {2, -2}, but with a higher probability everything will be fine, especially with strings. In the case of lists that can contain such integers, you can always implement a custom a hash function (possibly one that takes the repeat index of a certain value as a parameter and, accordingly, returns a unique hash code).

Here is an example of such an algorithm that quite effectively copes with the above problem. It also has the advantage of significantly increasing the distribution of generated hash codes (see the article at the end for some explanation). A mathematical / statistical analysis of exactly how this algorithm generates the “best” hash codes would be quite advanced, but testing it in a wide range of input values ​​and plotting the results should confirm this quite well.

 public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; int curHash; int bitOffset = 0; // Stores number of occurences so far of each value. var valueCounts = new Dictionary<T, int>(); foreach (T element in source) { curHash = EqualityComparer<T>.Default.GetHashCode(element); if (valueCounts.TryGetValue(element, out bitOffset)) valueCounts[element] = bitOffset + 1; else valueCounts.Add(element, bitOffset); // The current hash code is shifted (with wrapping) one bit // further left on each successive recurrence of a certain // value to widen the distribution. // 37 is an arbitrary low prime number that helps the // algorithm to smooth out the distribution. hash = unchecked(hash + ((curHash << bitOffset) | (curHash >> (32 - bitOffset))) * 37); } return hash; } 

multiplication

Which has few advantages over addition: small numbers and a combination of positive and negative numbers can lead to a better distribution of hash bits. As a negative value for the offset, this “1” becomes a useless record that does not contribute anything, and any null element results in zero. You can set a special case to zero so as not to cause this serious flaw.

 public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 17; foreach (T element in source) { int h = EqualityComparer<T>.Default.GetHashCode(element); if (h != 0) hash = unchecked (hash * h); } return hash; } 

Order first

Another basic approach is to clean up first and then use whatever hash function you like. Order itself does not matter if it is sequential.

 public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { int hash = 0; foreach (T element in source.OrderBy(x => x, Comparer<T>.Default)) { // f is any function/code you like returning int hash = f(hash, element); } return hash; } 

This has some significant advantages in that the union operations possible in f can have significantly better hashing properties (for example, bit allocation), but this happens at a significantly higher cost. Sort is O(n log n) and the required copy of the collection is a memory allocation that you cannot avoid if you want to avoid changing the original. GetHashCode implementations should usually completely avoid allocation. One of the possible implementations of f would be similar to that given in the last example in the Adding section (for example, any remaining number of bit shifts to the left, followed by multiplication by a prime number - you could even use consecutive primes at each iteration at no additional cost, since they should only be generated once).

However, if you have dealt with cases where you can calculate and cache the hash and amortize the cost of many GetHashCode calls, this approach can lead to excellent behavior. In addition, the latter approach is even more flexible because it avoids the need to use GetHashCode for elements if it knows their type, and instead use byte operations for them to provide even better hash distribution. This approach is likely to be useful only when performance has been identified as a significant bottleneck.

Finally, if you want to get a fairly complete and fairly non-mathematical overview of the subject of hash codes and their overall performance, these blog posts would be useful for reading, in particular the post “Implementing a Simple Hashing Algorithm (pt II)”.

+72
Mar 21 '09 at 21:52
source share

An alternative to sorting lists of strings would be to get the hash codes of the strings, and then sort the hash codes. (Comparing ints is less expensive than comparing strings.) Then you can use the algorithm to combine hash codes that (hopefully) give a better distribution.

Example:

 GetHashCodeOfList<T>(IEnumerable<T> list) { List<int> codes = new List<int>(); foreach (T item in list) { codes.Add(item.GetHashCode()); } codes.Sort(); int hash = 0; foreach (int code in codes) { unchecked { hash *= 251; // multiply by a prime number hash += code; // add next hash code } } return hash; } 
+21
Mar 21 '09 at 23:20
source share
  Dim list1 As ArrayList = New ArrayList() list1.Add("0") list1.Add("String1") list1.Add("String2") list1.Add("String3") list1.Add("abcdefghijklmnopqrstuvwxyz") Dim list2 As ArrayList = New ArrayList() list2.Add("0") list2.Add("String3") list2.Add("abcdefghijklmnopqrstuvwxyz") list2.Add("String2") list2.Add("String1") If GetHashCodeOfList(list1) = GetHashCodeOfList(list2) Then Stop Else Stop End If For x As Integer = list1.Count - 1 To 0 Step -1 list1.RemoveAt(list1.Count - 1) list2.RemoveAt(list2.Count - 1) Debug.WriteLine(GetHashCodeOfList(list1).ToString) Debug.WriteLine(GetHashCodeOfList(list2).ToString) If list1.Count = 2 Then Stop Next Private Function GetHashCodeOfList(ByVal aList As ArrayList) As UInt32 Const mask As UInt16 = 32767, hashPrime As Integer = Integer.MaxValue Dim retval As UInt32 Dim ch() As Char = New Char() {} For idx As Integer = 0 To aList.Count - 1 ch = DirectCast(aList(idx), String).ToCharArray For idCH As Integer = 0 To ch.Length - 1 retval = (retval And mask) + (Convert.ToUInt16(ch(idCH)) And mask) Next Next If retval > 0 Then retval = Convert.ToUInt32(hashPrime \ retval) 'Else ???? Return retval End Function 
0
Mar 22 '09 at 13:50
source share

Much less code, but perhaps the performance is not as good as the other answers:

 public static int GetOrderIndependentHashCode<T>(this IEnumerable<T> source) => source == null ? 0 : HashSet<T>.CreateSetComparer().GetHashCode(new HashSet<T>(source)); 
0
Feb 19 '19 at 20:39
source share

Here is a hybrid approach. It combines three commutative operations (XOR, addition and multiplication), applying each in different ranges of a 32-bit number. The bit range of each operation is adjustable.

 public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source) { var comparer = EqualityComparer<T>.Default; const int XOR_BITS = 10; const int ADD_BITS = 11; const int MUL_BITS = 11; Debug.Assert(XOR_BITS + ADD_BITS + MUL_BITS == 32); int xor_total = 0; int add_total = 0; int mul_total = 17; unchecked { foreach (T element in source) { var hashcode = comparer.GetHashCode(element); int xor_part = hashcode >> (32 - XOR_BITS); int add_part = hashcode << XOR_BITS >> (32 - ADD_BITS); int mul_part = hashcode << (32 - MUL_BITS) >> (32 - MUL_BITS); xor_total = xor_total ^ xor_part; add_total = add_total + add_part; if (mul_part != 0) mul_total = mul_total * mul_part; } xor_total = xor_total % (1 << XOR_BITS); // Compact add_total = add_total % (1 << ADD_BITS); // Compact mul_total = mul_total - 17; // Subtract initial value mul_total = mul_total % (1 << MUL_BITS); // Compact int result = (xor_total << (32 - XOR_BITS)) + (add_total << XOR_BITS) + mul_total; return result; } } 

Performance is almost identical to the simple XOR method, because the GetHashCode call of each element dominates the CPU load.

0
Apr 18 '19 at 17:50
source share



All Articles