I was asked about this in a recent interview.

Question

I was asked about this in a recent interview.

I was asked to stay away from HashMap or any type of Hashing.

The question went something like this:

Suppose you have PRODUCT identifiers up to 20 decimal places in length, along with product descriptions. Without using Maps or any hash function, what is the best / most efficient way to store / retrieve these product identifiers along with their descriptions?

Why is a bad idea used for such a scenario?

What changes will you make to sell your Amazon solution?

+4

algorithm data-structures

maximus Oct 11 '09 at 15:17

source share

14 answers

A B-Tree , in my opinion. Is this still considered a card?

Mostly because you can load multiple items in memory at once. The search for these elements in memory is very fast.

+6

Edison gustavo muenz Oct 11 '09 at 15:23

source share

Serial integers provide the perfect choice for a hash map, but it has only one problem, as it does not have multithread access by default. Also, since Amazon is mentioned in your question, I might think that you need to consider compatibility issues and RAM limitations.

What you can do in response to such a question is to explain that you do not allow any built-in data storage schemes, all you can do is imitate one.

So, let's say you have M = 10 ^ 20 products with their numbers and descriptions. You can split this set into groups of N subsets. You can then organize the M / N containers, which greatly reduced the number of items. Using this idea recursively will give you a way to save the entire set in containers with this property so that access to them takes the speed of work.

To illustrate this idea, consider a smaller example of 20 elements. I would like you to introduce a file system with the directories "1", "2", "3", "4". In each directory, you save product descriptions as files as follows:

folder 1: files 1 to 5 folder 2: files 6 to 10 ... folder 4: files 16 to 20

Then it takes only two steps to find the file. First, you are looking for the right folder, dividing 20/5 (your M / N). Then you use this identifier to read the description of the product stored in the file.

This is just a very crude description, but the idea is very intuitive. So perhaps this is what your interviewer wanted to hear about.

As for myself, when I come across such questions at an interview, even if I cannot ask the question correctly (this is the worst case :)), I always try to get the right answer from the interviewer.

+4

alexkr Oct 11 '09 at 16:19

source share

Best / effective for what? Would be my answer.

eg. to store them, you probably need to quickly create two arrays of 20 elements. One for identifiers, for description. Iterating over them pretty fast. And this is effective memory.

Of course, the solution is practically useless for any real application, but the question is also.

+2

Jens schauder Oct 11 '09 at 15:38

source share

There is an interesting alternative to B-Tree: Radix Tree

+1

Kamarey Oct 11 '09 at 15:43

source share

I think he wanted you to do it, and I'm not saying that it is a good idea to use the computer's memory space.

If you use a 64-bit (virtual) memory address and assume that you have all the address space for your data (this never happens), you can save a single-byte value.

You can use ProductID as an address by hovering it over the pointer, and then get this byte, which may be an offset in another memory for the actual data.

I will not do it this way, but perhaps this is the answer they were looking for.

Asaf

+1

Asaf r Oct 11 '09 at 16:16

source share

I wonder if they would like you to notice that in an e-commerce application (for example, Amazon's), the usual use case is a “reverse search”: get the product identifier using the description. To do this, use an inverted index, where each keyword in the description is an index key that is associated with a list of corresponding product identifiers. Binary trees or skip lists are good ways to index these keywords.

Regarding the product identifier index: in practice, B-Trees (which are not binary search trees) will be used for the large disk index of 20-bit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very well to tricks.

+1

erickson Oct 11 '09 at 10:01

source share

Symbols work very well if the hash function gives you a very even distribution of the hash values of existing keys. With a very bad hash function, it can happen that the hash values of your 20 values are the same, which will lead to the search time being up to O (n). Binary search, on the other hand, guarantees you O (log n), but inserting data is more expensive.

All this is very incremental, the larger your dataset, the less likely it is to have a bad distribution of keys (if you use a good proven hashing algorithm), and on smaller datasets the difference between O (n) and O (log n) is nothing to worry about.

0

mfeingold Oct 11 '09 at 15:31

source share

20 decimal product identifiers with product description

A simple linear search will be very good ...

I would create one simple array with identifiers. And another data array.

A linear search for a small number of keys (20!) Is much more efficient than any binary tree or hash.

0

Artyom Oct 11 '09 at 15:36

source share

If size is limited, it is sometimes faster to use a sorted list.

When you use Hash-anything, you first need to calculate the hash, then find the hash bucket, and then use equals for all items in the bucket. So it all adds up.

Alternatively, you can use a simple ArrayList (or any other List flavor that suits the application), sort it using java.util.Collections.sort and use java.util.Collections.binarySearch to search for the item.

But, as Artem noted, perhaps a simple linear search in this case will be much faster.

On the other hand, in terms of serviceability, I usually used a HashMap (or LinkedHashMap) here and would only do something special here when the profiler tells me that I am doing this. Also, collections of 20 tend to become 20,000 collections over time, and all this optimization will be wasted.

0

Alexander Pogrebnyak Oct 11 '09 at 15:44

source share

There is nothing wrong with hashing or B-trees for this kind of situation - your interviewer may just want you to think a bit, instead of leaving the expected answer. This is a good sign when interviewers want candidates to think. This shows that the organization appreciates the thought, and not just tears something from the lecture notes from CS0210.

By the way, I assume that “20 decimal product identifiers” means “a large set of product identifiers whose format is 20 decimal places” .... because if there are only 20, then it makes no sense to consider the algorithm. If you cannot use the hash or Btrees code with a linear search and move on. If you like, sort your array and use binary search.

But if my assumption is correct, then what the interviewer asks seems to revolve around the trade-offs of hashmap time / space. It is possible to improve the hashmap time / space curve - hash maps have collisions. So you can get some improvement by converting 20 decimal digits to a number and using this as an index for a sparsely populated array ... a really large array. :)

Selling it to the Amazon? Good luck with that. Everything you come up with must be patentable, and nothing in this discussion seems to be rising to this level.

0

CPerkins Oct 11 '09 at 15:47

source share

I have a feeling based on their answer about product identifiers and the two digits they were looking for to convert the numerical product identifiers to another base system or packaged form.

They indicated that the product description was associated with product identifiers to inform you that a higher base system can be used in the current field data type.

0

Einstein Oct 11 '09 at 16:40

source share

Your interlocutor may be looking for a trick. If you have a [small] constant upper bound on your key, then you have O (1) insertion and search.

0

Ellery newcomer Oct 11 '09 at 16:56

source share

I think he wanted you to do it, and I am not saying that it is a good idea, it is to use the computer's memory space.
If you use 64-bit (virtual) memory address, and if you have all the address space for your data (never happens), you can save a single-byte value.

Unfortunately, 2 ^ 64 = approx. 1.8 * 10 19. Slightly lower than 10 ^ 20. Coincidence?

log2 (10 ^ 20) = 66.43.

Here's a bit of a vicious suggestion.

OK, 2 ^ 64 bits can fit in memory space.

Suppose an estimate of N bytes for description, for example, N = 200. (who wants to download Anna Karenina when they are looking for toasters?) Commandeer 8 * N 64-bit machines with large RAM. Amazon can brand this.

Each machine loads one bit of description text for all descriptions into its (very sparse) bitmap. Let MMU / virtual memory handle sparseness.

Distribute the product tag as a 59-bit number and a bit mask for one byte. (59 = ceil (log2 (10 ^ 20)) - 8)

Each machine returns one bit from the product description. Search is the dereferencing of virtual memory. You can even insert and delete.

Of course, paging will become a bitch at some point!

Oddly enough, it will work best if the product identifier is as lumpy and inapplicable a hash as possible.

0

Matt kennel Oct 12 '09 at 8:50

source share

yves baumes · Accepted Answer · 2009-10-11T16:15:58+0000

The card is good to use when insert / delete / search operations alternate. Each transaction is depreciated in O (log n).

In your example, you are doing a search operation. You might think that any database update (product insertion / deletion) will not take much time. Therefore, perhaps the interviewer wants you to get a better data structure for your searches.

In this case, I see only a few, as has already been suggested in other answers:

Sorted array (performing binary search)
Hasmap
trie

With trie, if product identifiers do not have a common prefix, there is a good chance to find a product description only by looking at the first character of the prefix (or only at the very first characters). For example, let this list of product identifiers contain 125 products:

"one"
"2"
"3"
...
"123"
"124"
"1234567"

Suppose you look for the product identifier “1234567” in your trick, just looking at the first letters: “1”, then “2”, then “3”, then “4” will lead to a good description of the product. No need to read the remaining product identifier as there are no other options. Given the length of the product identifier as n, your search will be in O (n). But since in the example described above, it may be even faster to get a product description. Since the procduct identifier is limited in size (20 characters), the height of three will be limited to 20 levels. This actually means that you can assume that search operations will never go beyond constant time, since your search will never pass at a height of three => O (1). While any BST requests are at best amortized with O (log N), N is the number of elements in your tree .

While the hash file may lead to a slower search, since you will need to calculate the index with a hash function, which is likely to be implemented when reading the length of the entire product identifier. Plus, viewing the list in the event of a collision with other product identifiers.

Performing a binary search in a sorted array, and performance in search operations will depend on the number of items in your database.

I was asked about this in a recent interview.

More articles: