We take a dataset, decorate each element with a signature and sort it. A signature has the property of sorting a group of these elements together, which may have duplicates. when comparing data_set [j] with the elements in data_set [j + 1 ...], when the first signature in [j + 1 ...] the duplicate check failed, we go ahead. This “adjacency criterion” ensures that we do not need to look further; no item outside this can be a duplicate.
This greatly reduces the comparison of O (N ^ 2). As far as I allow, the analyst of the algorithm solves, but the code below is ~ 400k comparisons instead of 100 m naive O (N ^ 2).
The signature begins with a breakdown of the elements. Divide the range of numbers in N cones of equal size: 1..k, k + 1..2k, 2k + 1..3k, ... When iterating over the elements, we increase the counter if the number falls into a special bucket. This gives the initial signature form (0,0,0,1,3,0,0, ... 4,2).
A signature has the property that if
sum(min(sig_a[i], sig_b[i]) for i in range(10)) >= 5
it is possible that signature-related items have at least 5 duplicates. But more, if this is not done above, then the elements cannot have 5 duplicates. Lets call it "signature matching criteria."
But, sorting by the above signature does not have the adjacency property mentioned above. However, if we modify the signature two elements:
(sum(sig[:-1]), sig[-1])
then the "signature matching criteria" is met. But is adjacency criteria satisfied? Yes. The sum of this signature is 10. If we list, we have the following possible signatures:
(0,10) (1, 9) (2, 8) (3, 7) (4, 6) (5, 5) (6, 4) (7, 3) (8, 2) (9, 1) (10,0)
If we compare (0.10) with (1.9) .. (10.0), we note that as soon as the signature test fails, it will never become true again. adjacency criterion. In addition, this adjacency criterion is satisfied for all positive values, not just "5".
Well, that's nice, but dividing the signature into two large buckets will not necessarily reduce the search for O (N ^ 2); signature overly general. We solve this problem by creating a signature for sig [: - 1], producing
(sum(sig[:-1]), sig[-1]), (sum(sig[:-2]), sig[-2]), ...
etc. I believe that this signature still satisfies the adjacency, but I could be wrong.
There are some optimizations that I did not do: the signature requires only the last value of each tuple, and not the first, but the sorting step should be reviewed. In addition, the signature comparison can be optimized with an early failure when it becomes clear that further scanning cannot be successful.
# python 3.0 import random