Algorithm for matching lists of integers

For each day, we have approximately 50,000 instances of the data structure (this can become much larger over time), which encompass the following:

DateTime AsOfDate; int key; List<int> values; // list of distinct integers 

It probably doesn't matter, but the values list is a list of different integers with a property that, given a value of AsOfDate combining values across all key values ​​creates a list of different integers. That is, the integer does not appear in two different values lists on the same day.

Lists usually contain very few elements (from one to five), but sometimes up to 50 elements.

Given adjacent days, we try to find instances of these objects for which the key values ​​are different on two days, but the values list contains the same integers.

We use the following algorithm. Convert list of values to string with

 string signature = String.Join("|", values.OrderBy(n => n).ToArray()); 

then hash signature in an integer, arrange the resulting hash code lists (one list for each day), go through two lists looking for matches, and then check to see if the associated keys are different. (Also check the linked lists to make sure we did not have a hash collision.)

Is there a better way?

+4
source share
6 answers

You can probably just hash the list itself, rather than go through String.

Also, I think your algorithm is almost optimal. Assuming no hash collisions, this is O (n log n + m log m), where n and m are the number of entries for each of the two days that you are comparing. (Sorting is a bottleneck.)

You can do this in O (n + m) if you use a bucket array (essentially: a hash table) into which you insert the hashes. You can compare two bucket arrays in O (max (n, m)) assuming that the length depends on the number of records (to get a reasonable load factor).

It should be possible for the library to do this for you (it looks like you are using .NET) using HashSet.IntersectWith () and writing a suitable comparison function.

You cannot do better than O (n + m), because you need to visit every record at least once.

Edit: wrong, fixed.

+5
source

In addition to other answers, you can speed up the process by creating a low-cost hash just built from XOR among all the elements of each List. You will not need to order your list, and all you get is an int , which is easier and faster to store than strings.

Then you just need to use the received XORed number as the key to the Hashtable and check for the presence of the key before inserting it. If an existing key already exists, only then will you sort the relevant lists and compare them.

You still need to compare them if you find a match, because there may be some collisions using plain XOR.
I think the result will be much faster and will have much less memory than reordering arrays and converting them to strings.

If you have your own implementation of List<> , you can create an XOR key generation inside it so that it is recalculated during each operation in the list.
This will speed up checking for duplicate lists.

code

The following is the first attempt to implement this.

 Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>(); public bool CheckDuplicate(List<int> theList) { bool isIdentical = false; int xorkey = 0; foreach (int v in theList) xorkey ^= v; List<List<int>> existingLists; checkHash.TryGetValue(xorkey, out existingLists); if (existingLists != null) { // Already in the dictionary. Check each stored list foreach (List<int> li in existingLists) { isIdentical = (theList.Count == li.Count); if (isIdentical) { // Check all elements foreach (int v in theList) { if (!li.Contains(v)) { isIdentical = false; break; } } } if (isIdentical) break; } } if (existingLists == null || !isIdentical) { // never seen this before, add it List<List<int>> newList = new List<List<int>>(); newList.Add(theList); checkHash.Add(xorkey, newList); } return isIdentical; } 

Not the most elegant or easy to read at a glance, it's more like hacks, and I'm not even sure that it works better than the more elegant version from Guffa.
What he does, though he takes care of the collision in the XOR key, storing the List<int> lists in the dictionary.

If a duplicate key is found, we look at each previously saved list until we find a mismatch.

A good point for the code is that it should probably be as fast as you could get in most cases and still faster than compiling lines in a collision.

+4
source

To implement IEqualityComparer for List, you can use this list as a key in the dictionary.

If the lists are sorted, it could be so simple:

 IntListEqualityComparer : IEqualityComparer<List<int>> { public int GetHashCode(List<int> list) { int code = 0; foreach (int value in list) code ^=value; return code; } public bool Equals(List<int> list1, List<int> list2) { if (list1.Count != list2.Coount) return false; for (int i = 0; i < list1.Count; i++) { if (list1[i] != list2[i]) return false; } return true; } } 

Now you can create a dictionary that uses IEqualityComparer:

 Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer()); 

Add all the items from the first day in the dictionary, then scroll through the items from the second day and check if the key exists in the dictionary. Since IEqualityComprarer processes the hash code and comparison, you will not get any false matches.

You can test several different hash code calculation methods. One in this example works, but may not give the best performance for your specific data. The only requirement for a dictionary hash code to work is that the same list always gets the same hash code, so you can pretty much do what you want to calculate. The goal is to get as many different hash codes for the keys in your dictionary as possible so that there are as few elements (with the same hash code) in each bucket.

+2
source

Does it have an order? those. [1,2] on day 1 and [2,1] on day 2, are they equal? If they are, then hashing may not work as well. Instead, you can use a sorted array / vector to help in comparison.

Also, what keys are there? Does it have a certain range (e.g. 0-63)? Perhaps you can combine them into a large integer (it may require accuracy higher than 64-bit) and a hash instead of converting to a string, as this may take some time.

0
source

It might be worth placing this in an SQL database. If you do not want to have a full-blown DBMS, you can use sqlite.

This would make uniqueness checks and associations and these types of operations very simple and very efficient. It will also allow you to easily store information if it is ever needed again.

0
source

Could you summarize the list of values ​​to get an integer that can be used as a preliminary selection of whether the other list contains the same set of values?

Although there will be much more collisions (the same amount does not necessarily mean the same set of values), but I think that it can first reduce the set of comparisons required by most.

0
source

All Articles