The basic hashtable algorithm is duplicate removal

Question

The basic hashtable algorithm is duplicate removal

Yesterday morning I had an interview, and they asked me the question "Give an algorithm to remove duplicates from a list of integers." This is a pretty standard question, so I was sure I could answer it.

I rephrase, but I said something like the lines "You can use a hash table. Start with the first integer and paste it into the hash table. Then for each consecutive integer, search for the hash table to check if the integer is valid in the hash table, if not, then insert it, if it is already there, then throw it away because it is a duplicate. So, iterate through the list this way. If the hash table is designed correctly, queries and inserts should be constant on average . "

Then the interviewer replied (again I rephrase) "But the search for the hash table is not a constant time, it depends on how many elements are already in it. The algorithm that you described will be O (n ^ 2)"

Then I answered "Really?" I thought that if you developed a good hash function, would it be a constant time? This is usually O (n) "

Then the interviewer replied: “So you say that the search time will be the same for a hash table with many records and a hash table with several records”

Then I said: "Yes, if it is designed correctly."

Then the interviewer said: "This is not so."

So, I'm very confused right now. If someone can indicate where I am wrong, I would be very grateful

+5

hashtable duplicates

user1893354 May 16 '13 at 13:44

source share

1 answer

dasblinkenlight · Accepted Answer · 2013-05-16T13:53:01+0000

If someone can indicate where I am wrong

You are not mistaken at all: correctly designed hash tables give the expected search efficiency O(1) and insert them into the amortized O(1) , so your algorithm is O(N) . Searching in heavily loaded hash tables is indeed a bit slower due to possible duplication of resolution, but the expected search time remains O(1) . This may not be enough for real-time systems, where "depreciation" is not taken into account, but in all practical situations this is enough.

Of course, you can always use a balanced tree for the elements that you saw for the worst O(N*LogN) , or if the numbers have reasonable boundaries (for example, from 0 to 100,000), you can use a logical array to test membership in O(1) Worst case and potential hash table improvement due to a smaller constant factor.

The basic hashtable algorithm is duplicate removal

More articles: