Best duplicate delete algorithm in a string array

Today at school, the teacher asked us to introduce a duplication-removal algorithm. This is not so difficult, and everyone came up with the following solution (pseudocode):

for i from 1 to n - 1 for j from i + 1 to n if v[i] == v[j] then remove(v, v[j]) // remove(from, what) next j next i 

The computational complexity for this algorithm is n(n-1)/2 . (We are in high school, and we did not talk about big-O, but it seems to be O(n^2) ). This solution seems ugly and of course slow, so I tried to do something faster:

 procedure binarySearch(vector, element, *position) // this procedure searches for element in vector, returning // true if found, false otherwise. *position will contain the // element place (where it is or where it should be) end procedure ---- // same type as v vS = new array[n] for i from 1 to n - 1 if binarySearch(vS, v[i], &p) = true then remove(v, v[i]) else add(vS, v[i], p) // adds v[i] in position p of array vS end if next i 

This way vS will contain all the elements that we have already passed. If the element v[i] is in this array, then it is a duplicate and is deleted. The computational complexity for the binary search is log(n) , and the main loop (second fragment) is n . Therefore all CC n*log(n) if I am not mistaken.

Then I had a different idea about using a binary tree, but I can not postpone it.
Mostly my questions:

  • Is my CC calculation correct? (and if not, why?)
  • Is there a faster way to do this?

thanks

+8
string algorithm complexity-theory big-o duplicates
source share
5 answers

The simplest solution is to simply sort the array (accepts O (n log n) with a standard implementation if you can use them. Otherwise, consider creating a simple randomized quick sort (code even on Wikipedia)).

Then scan it for another extra time. During this scan, simply eliminate consecutive identical items.

If you want to do this in O (n), you can also use a HashSet with elements that you have already seen. Just repeat once over your array, for each element check if it is in your HashSet.

If it is not there, add it. If it is there, remove it from the array.

Note that this will take some extra memory, and hashing will have a constant factor that contributes to your work. Although the time complexity is better, the practical execution time will only be faster if you exceed a certain size of the array.

+13
source share

You can often use the space-time complement and invest more space to reduce time.

In this case, you can use a hash table to identify unique words.

+6
source share

add O(n) , so your CC calculation is incorrect. Your algorithm is O(n^2) .

Also, how would I implement remove ? It looks like it will be O(n) - so the original algorithm will be O(n^3) .

+2
source share

Binary search will only work if the array you are looking for is sorted. I assume this is not the case, or you will not iterate over your entire array in the inner loop of the original solution.

0
source share

If the order of the final solution does not matter, you can split the array into smaller arrays based on the length of the strings, and then remove duplicates from these arrays. Example:

 // You have {"a", "ab", "b", "ab", "a", "c", "cd", "cd"}, // you break it into {"a", "b", "a", "c"} and {"ab", "ab", "cd", "cd"}, // remove duplicates from those arrays using the merge method that others have mentioned, // and then combine the arrays back together into {"a", "b", "c", "ab", "cd"} 
0
source share

All Articles