Efficient algorithm for comparing similarity between sets of numbers?

I have a large number of sets of numbers. Each set contains 10 numbers, and I need to delete all sets with 5 or more numbers (unordered) with any other set.

For instance:

set 1: {12,14,222,998,1,89,43,22,7654,23} set 2: {44,23,64,76,987,3,2345,443,431,88} set 3: {998,22,7654,345,112,32,89,9842,31,23} 

Given 3 sets of 10 numbers over sets 1 and 3 sets will be considered duplicate since they have 5 corresponding numbers. So, in this case, I would delete set 3 (because it was considered similar to set 1).

I have over 10,000 sets for comparison, and I want to do this very efficiently. I flipped it over and I just can't think of an effective way to do this comparison (it would be great to do it in one go).

any ideas? Thanks!

Mike

+4
source share
12 answers

You have to rethink your requirements, because it is so, the operation does not even have a clearly defined result. For example, take these sets:

 set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15} set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20} 

If you first consider 1 and 2 as "duplicates" and exclude set 1, then 2 and 3 are also "duplicates", and you only have one remaining set. But if you first exclude set 2, then 1 and 3 have no matches, and you stay with the other two sets.

You can easily expand this to a full 10,000 sets, so that it is possible that depending on which sets you are comparing and eliminating in the first place, you can stay with only one set or with 5,000 sets. I do not think this is what you want.

Mathematically speaking, your problem is that you are trying to find equivalence classes, but the similarity relationship that you use to define them is not an equivalence relation . In particular, this is not transitive. In layman's terms, if the set A is “similar” to the set B, and B is “similar” to the set C, then your definition does not guarantee that A is also “similar” to C, and therefore you cannot significantly eliminate such sets.

You need to first clarify your requirements to solve this problem before worrying about effective implementation. Either find a way to determine transient similarity, or save all the sets and work only with the comparison (or with a list of similar sets for each individual set).

+27
source

Another great job for a signature tree . Once again, I am stunned that there is no library that implements them. Let me know if you write.

From the abstract of the first article in the search results above:

We propose a method that presents given data as raster images (signatures) and organizes them into a hierarchical index suitable for searching for similarity and other related types of queries. Unlike the previous method, the signature tree is dynamic and does not rely on hard constants. Experiments with synthetic and real data sets show that it is resistant to various data characteristics, scales to the size of the database and is effective for various queries.

+5
source

You do not say much about what range of numbers may appear, but I have two ideas:

  • an inverted list that displays the number that appears in the lists to the lists that contain it, then crosses these lists to find those that have more than one common number.

  • divide the numbers or group them into ranges of "closed" numbers, then refine the (narrow) lists that have numbers appearing in these ranges. You reduce the ranges for matching lists, you have a manageable number of lists, and you can accurately compare the lists. I think it will be "intimacy."

+2
source

I do not think that there is a beautiful and beautiful way to do this. In most other answers, you will make a comparison between most x,y pairs, which will be O(N^2) . You can do it faster.

Algorithm: keep an array of all 5-tuples. For each new partition into all possible 5-tuples, add to this array. At the end, sort and check for duplicates.

There is C(10, 5) = 10*9*8*7*6/120 = 9*4*7 , about 250 subsets of length 5 of a set of length 10. Thus, you save a table that is 10^3 times your data , but only performs O(250*N) operations. This should work in practice, and I suspect it is the best theoretically.

+2
source

Since you need to compare all pairs of sets, the algorithm is about O (N ^ 2), where N is the size of the set.

For each comparison, you can do O (X + Y), where X and Y are the size of two sets, in your case 10 each, so it is constant. But this requires that you sort all the sets in advance, so that adds O (N * xlgx), again xlgx is constant in your case.

The linear comparison algorithm for the two sets is quite simple, since the sorts are now sorted, you can iterate both sets at the same time. See C ++ std :: set_intersection for more details.

The whole algorithm is O (N ^ 2), which will be pretty slow for 10,000 sets.

+1
source

You should find the Pearson coefficient between two data sets. This method will make your program easily scalable for huge data sets.

+1
source

There is a way to do this with high time efficiency but extremely low space efficiency.

If my math is correct, each combination of 5 numbers from a set of 10 leads to 10! (10-5)! 5! = 252 combinations multiplied by 10,000 sets = 2.52 million combinations. A set of 5 integers will consume 20 bytes, so you can put each combination for each set in a HashSet . and use only 5 megabytes (plus the overhead that blows it at least 2-3 times).

Now this may seem expensive, but if the alternative, when you test a new set of 10 against an existing 10,000 indidvidually, is that you calculate 252 sets of 5 and see if there is any of them in the set, then it should be better .

Basically:

 public class SetOf5 { private final static HashSet<Integer> numbers; private final int hashCode; public SetOf5(int... numbers) { if (numbers.length != 5) { throw new IllegalArgumentException(); } Set<Integer> set = new HashSet<Integer>(); hashCode = 19; for (int i : numbers) { set.add(i); hashCode = 31 * i + hashCode; } this.numbers = Collections.unmodifiableSet(set); } // other constructors for passing in, say, an array of 5 or a Collectio of 5 // this is precalculated because it will be called a lot public int hashCode() { return numbers.hashCode(); } public boolean equals(Object ob) { if (!(ob instanceof SetOf5)) return false; SetOf5 setOf5 = (SetOf5)ob; return numbers.containsAll(setOf5.numbers); } } 

You just need to do two things:

  • Create a HashSet<SetOf5> for all existing tuples of 5; and
  • Create an algorithm to create all possible sets of 5.

Then your algorithm becomes the following: for each set of 10 numbers, create all possible sets of 5, check each one to see if it is in the set. If so, reject the set of 10. If it is not, add the set of 5 to the “set of sets”. Otherwise, continue.

I think you will find that it will be much cheaper - at least in the case of 5 numbers out of 10 - than comparing the brute force of 10,000 sets with each other.

+1
source

Maybe you need such an algorithm (as I understand your problem)?

 import java.util.Arrays; import java.util.HashSet; import java.util.LinkedList; import java.util.List; import java.util.Set; /** * @author karnokd, 2009.06.28. * @version $Revision 1.0$ */ public class NoOverlappingSets { // because of the shortcomings of java type inference, O(N) public static Set<Integer> setOf(Integer... values) { return new HashSet<Integer>(Arrays.asList(values)); } // the test function, O(N) public static boolean isNumberOfDuplicatesAboveLimit( Set<Integer> first, Set<Integer> second, int limit) { int result = 0; for (Integer i : first) { if (second.contains(i)) { result++; if (result >= limit) { return true; } } } return false; } /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub List<Set<Integer>> sets = new LinkedList<Set<Integer>>() {{ add(setOf(12,14,222,998,1,89,43,22,7654,23)); add(setOf(44,23,64,76,987,3,2345,443,431,88)); add(setOf(998,22,7654,345,112,32,89,9842,31,23)); }}; List<Set<Integer>> resultset = new LinkedList<Set<Integer>>(); loop: for (Set<Integer> curr : sets) { for (Set<Integer> existing : resultset) { if (isNumberOfDuplicatesAboveLimit(curr, existing, 5)) { continue loop; } } // no overlapping with the previous instances resultset.add(curr); } System.out.println(resultset); } } 

I am not an expert in Big O notation, but I think that this algorithm is O (N * M ^ 2), where N is the number of elements in the set and M is the total number of sets (based on the number of cycles I used in the algorithm) . I allowed myself to determine what I consider to be overlapping sets.

I think your problem is Polynomial. Since I remember my lectures, the solution is based on the NP-hard version - but correct me if I am wrong.

0
source

This is a simple problem because your sets are limited in size to ten. For each set of ten numbers, you have less than 1000 subsets of the set that contain at least five numbers. Choose a hash function that hashes entire sequences, for example, in 32-bit numbers. For each set of ten integers, calculate the value of this hash function for each subset of integers with five or more elements. This gives less than 1000 hash values ​​per set of ten numbers. Add a pointer to a set of ten integers in the hash table under all these 1000 keys. As soon as you do this, your hash table has 1000 * 10 000 = 10 million records, which is quite doable; and this first pass is linear (O (n)) since the size of an individual set is limited to 10.

In the next pass, repeat all hash values ​​in any order. Whenever there is more than one set associated with the same hash value, this means that they most likely contain a common subset of at least five integers. Check this, and then delete one of the sets and the corresponding entries in the hash table. Continue through the hash table. This is also an O (n) step.

Finally, suppose you do this in C. Here is a procedure that will calculate the hash values ​​for one set of ten integers. It is assumed that integers are in ascending order:

 static int hash_index; void calculate_hash(int *myset, unsigned int *hash_values) { hash_index = 0; hrec(myset, hash_values, 0, 0, 0); } void hrec(int *myset, unsigned int *hash_values, unsigned int h, int idx, int card) { if (idx == 10) { if (card >= 5) { hash_values[hash_index++] = h; } return; } unsigned int hp = h; hp += (myset[idx]) + 0xf0f0f0f0; hp += (hp << 13) | (hp >> 19); hp *= 0x7777; hp += (hp << 13) | (hp >> 19); hrec(myset, hash_values, hp, idx + 1, card + 1); hrec(myset, hash_values, h, idx + 1, card); } 

This repeats through all the subsets of 1024 and stores the hash values ​​for the subsets with a capacity of 5 or more in the hash_values array. At the end, hash_index counts the number of valid entries. It is, of course, permanent, but I did not calculate it numerically here.

0
source

We take a dataset, decorate each element with a signature and sort it. A signature has the property of sorting a group of these elements together, which may have duplicates. when comparing data_set [j] with the elements in data_set [j + 1 ...], when the first signature in [j + 1 ...] the duplicate check failed, we go ahead. This “adjacency criterion” ensures that we do not need to look further; no item outside this can be a duplicate.

This greatly reduces the comparison of O (N ^ 2). As far as I allow, the analyst of the algorithm solves, but the code below is ~ 400k comparisons instead of 100 m naive O (N ^ 2).

The signature begins with a breakdown of the elements. Divide the range of numbers in N cones of equal size: 1..k, k + 1..2k, 2k + 1..3k, ... When iterating over the elements, we increase the counter if the number falls into a special bucket. This gives the initial signature form (0,0,0,1,3,0,0, ... 4,2).

A signature has the property that if

 sum(min(sig_a[i], sig_b[i]) for i in range(10)) >= 5 

it is possible that signature-related items have at least 5 duplicates. But more, if this is not done above, then the elements cannot have 5 duplicates. Lets call it "signature matching criteria."

But, sorting by the above signature does not have the adjacency property mentioned above. However, if we modify the signature two elements:

 (sum(sig[:-1]), sig[-1]) 

then the "signature matching criteria" is met. But is adjacency criteria satisfied? Yes. The sum of this signature is 10. If we list, we have the following possible signatures:

 (0,10) (1, 9) (2, 8) (3, 7) (4, 6) (5, 5) (6, 4) (7, 3) (8, 2) (9, 1) (10,0) 

If we compare (0.10) with (1.9) .. (10.0), we note that as soon as the signature test fails, it will never become true again. adjacency criterion. In addition, this adjacency criterion is satisfied for all positive values, not just "5".

Well, that's nice, but dividing the signature into two large buckets will not necessarily reduce the search for O (N ^ 2); signature overly general. We solve this problem by creating a signature for sig [: - 1], producing

 (sum(sig[:-1]), sig[-1]), (sum(sig[:-2]), sig[-2]), ... 

etc. I believe that this signature still satisfies the adjacency, but I could be wrong.

There are some optimizations that I did not do: the signature requires only the last value of each tuple, and not the first, but the sorting step should be reviewed. In addition, the signature comparison can be optimized with an early failure when it becomes clear that further scanning cannot be successful.

 # python 3.0 import random # M number of elements, N size of each element M = 10000 N = 10 # Bounds on the size of an element of each set Vmin,Vmax = 0, (1 << 12) # DupCount is number of identical numbers required for a duplicate DupCount = 5 # R random number generator, same sequence each time through R = random.Random() R.seed(42) # Create a data set of roughly the correct size data_set = [list(s) for s in (set(R.randint(Vmin, Vmax) for n in range(N)) for m in range(M)) if len(s) == N] # Adorn the data_set signatures and sort def signature(element, width, n): "Return a signature for the element" def pearl(l, s): def accrete(l, s, last, out): if last == 0: return out r = l[last] return accrete(l, sr, last-1, out+[(sr,r)]) return accrete(l, s, len(l)-1, []) l = (n+1) * [0] for i in element: l[i // width] += 1 return pearl(l, len(element)) # O(n lg(n)) - with only 10k elements, lg(n) is a little over 13 adorned_data_set = sorted([signature(element, (Vmax-Vmin+1)//12, 12), element] for element in data_set) # Count the number of possible intersections def compare_signatures(sig_a, sig_b, n=DupCount): "Return true if the signatures are compatible" for ((head_a, tail_a), (head_b, tail_b)) in zip(sig_a, sig_b): n -= min(tail_a, tail_b) if n <= 0: return True return False k = n = 0 for i, (sig_a, element_a) in enumerate(adorned_data_set): if not element_a: continue for j in range(i+1, len(adorned_data_set)): sig_b, element_b = adorned_data_set[j] if not element_b: continue k += 1 if compare_signatures(sig_a, sig_b): # here element_a and element_b would be compared for equality # and the duplicate removed by adorned_data_set[j][1] = [] n += 1 else: break print("maximum of %d out of %d comparisons required" % (n,k)) 
0
source

Suppose you have a NumberSet class that implements your unordered set (and can list an int to get numbers). Then you will need the following data structures and algorithm:

  • Map<int, Set<NumberSet>> numberSets
  • Map<Pair<NumberSet, NumberSet>, int> matchCount
  • Pair<X,Y> is a key object that returns the same hash code and equality for each instance with the same X and Y (regardless of their replacement)

Now for each added / compared set do the following (pseudo-code !!!):

 for (int number: setToAdd) { Set<NumberSet> numbers = numberSets.get(number); if (numbers == null) { numbers = new HashSet<NumberSet>(); numberSets.put(number, numbers); } else { for (NumberSet numberSet: numbers) { Pair<NumberSet, NumberSet> pairKey = new Pair<NumberSet, NumberSet>(numberSet, setToAdd); matchCount.put(pairKey, matchCount.get(pairKey)+1); // make sure to handle null as 0 here in real code ;) } } numbers.add(number); } 

At any time, you can go through pairs, and each of which has a score of 5 or more, shows a duplicate.

Note: deleting sets may be a bad idea, because if A is considered a duplicate of B and B is a duplicate of C, then C should not be a duplicate of A. If you delete B, you will not delete C, and the order in which you add your sets will become important.

-1
source

It looks like you want to use the HashSet class. This should give you O(1) lookup time, which should give a very efficient comparison if you get the correct loops. (I am not discussing the algorithm here, but simply suggesting a data structure in case this helps.)

-2
source

All Articles