If your elements are somehow comparable (the fact that the order has some real meaning is indifferent - it just needs to match your definition of equality), the fastest solution for removing duplicates is going to sort the list (0 (n log (n))) , then do one pass and look for duplicate elements (that is, equal elements that follow each other) (this is O (n)).
The overall complexity will be O (n log (n)), which roughly coincides with what you get with Set (n times long (n)), but with a much lower constant. This is due to the fact that the constant in sort / dedup arises because of the cost of comparing the elements, while the cost from the set is likely to be the result of computing the hash, plus one (possibly several) hash comparisons. If you use a hash-based Set implementation, that is because Tree based will give you O (n logΒ² (n)), which is even worse.
However, as I understand it, you do not need to delete duplicates, but simply check their existence. Thus, you must manually compile the merge or heap sorting algorithm in your array, which simply returns a return value of true (ie, "There is a duplicate") if your comparator returns 0 and otherwise completes the sorting and passes the sorted array check for repetitions, In the case of a merge or a heap, indeed, when the sorting is completed, you will compare each repeated pair if both elements are no longer in their final positions (which is unlikely). Thus, the selective sorting algorithm should give a huge increase in performance (I would have to prove it, but, I think, the algorithm with the modified algorithm should be in O (log (n)) on uniformly random data)
Varkhan Feb 18 '09 at 21:57 2009-02-18 21:57
source share