Open addressing against individual chains

Question

Open addressing against individual chains

What is the best way to handle collisions with a hashmap if the load factor is close to 1 to ensure minimal memory wear?

I personally believe that the answer is open addressing with linear sensing, because in the event of a collision it does not require additional storage space. Is it correct?

+7

hashtable hashmap hash-collision

user191776 Oct 30 '10 at 14:23

source share

3 answers

A hashmap that is complete degrades into a linear search, so you want to keep them at 90%.

You are right about open addressing using less memory; for a chain, each node field requires a pointer or offset field.

I created a hasharray data structure when I need very lightweight hash tables that don't have many inserts. To keep memory usage low, all data is embedded in the same memory block, first with a HashArray structure, then two arrays for hashes and values. Hasharray can only be used with a search key that is stored in a value.

typedef uint16_t HashType; /* this can be 32bits if needed. */ typedef uint16_t HashSize; /* this can be made 32bits if large hasharrays are needed. */ struct HashArray { HashSize length; /* hasharray length. */ HashSize count; /* number of hash/values pairs contained in the hasharray. */ uint16_t value_size; /* size of each value. (maximum size of value 64Kbytes) */ /* these last two fields are just for show, they are not defined in the HashArray struct. */ uint16_t hashs[length]; /* array of hashs for each value, this helps with resolving bucket collision */ uint8_t values[length * value_size]; /* array holding all values. */ }; #define hasharray_get_hashs(array) (HashType *)(((uint8_t *)(array)) + sizeof(HashArray)) #define hasharray_get_values(array) ((uint8_t *)(array)) + sizeof(HashArray) + \ ((array)->length * sizeof(HashType)) #define hasharray_get_value(array, idx) (hasharray_get_values(array) + ((idx) * (array)->value_size))

The hasharray_get_hashs and hasharray_get_values macros are used to get the hashs and values arrays.

I used this to quickly search for complex objects that are already stored in an array. Objects have a "name" field that is used for searching. Names are hashed and inserted into the hasharray with an index of objects. The values stored in hasharray can be indexes / pointers / integer objects (I use only small 16-bit index values).

If you want to pack a hashray until it is fully populated, you will want to use full 32-bit hashes instead of the 16-bit ones defined above. Large 32-bit hashes will facilitate a quick search when the hash is more than 90% full.

A hashhare, as defined above, can contain a maximum of 65535, which is great, since I never use it on anything that has more than a few hundred values. All that requires more, which should just use a regular hash table. But if memory is really a problem, the HashSize type can be changed to 32 bits. In addition, I use power-of-2 lengths to quickly find hash searches. Some people prefer to use the length of the main bucket, but this is only necessary if the hash function has a poor distribution.

 #define hasharray_empty_hash 0xFFFF /* hash value to mark empty slots. */ void *hasharray_search(HashArray *array, HashType hash, uint32_t *next) { HashType *hashs = hasharray_get_hashs(array); uint32_t mask = array->length - 1; uint32_t start_idx; uint32_t idx; hash = (hash == hasharray_empty_hash) ? 0 : hash; /* need one hash value to mark empty slots. */ start_hash_idx = (hash & mask); if(*next == 0) { idx = start_idx; /* new search. */ } else { idx = *next & mask; /* continuing search to next slot. */ } /* find hash in hash array. */ do { /* check for hash match. */ if(hashs[idx] == hash) goto found_hash; /* check for end of chain. */ if(hashs[idx] == hasharray_empty_hash) break; idx++; idx &= mask; } while(idx != start_idx); /* maximum tries reached (ie did a linear search of whole array) or end of chain. */ return NULL; found_hash: *next = idx + 1; /* where to continue search at, if this is not the right value. */ return hasharray_get_values(array) + (idx * array->value_size); }

a hash collision will occur, so the code that calls hasharray_search () should compare the search key with the one stored in the value object. If they do not match, then hasharray_search () is called again. Non-unique keys may also exist, as the search can continue until a “NULL” value is returned to find all values matching a single key. The search function uses linear sensing for regular caching.

 typedef struct { char *name; /* this is the lookup key. */ char *type; /* other field info... */ } Field; typedef struct { Field *list; /* array of Field objects. */ HashArray *lookup; /* hasharray for fast lookup of Field objects by name. The values stored in this hasharray are 16bit indices. */ uint32_t field_count; /* number of Field objects in 'list'. */ } Fields; extern Fields *fields_new(uint16_t count) { Fields *fields; fields = calloc(1, sizeof(Fields)); fields->list = calloc(count, sizeof(Field)); /* allocate hasharray to hold at most 'count' uint16_t values. * The hasharray will round 'count' up to the next power-of-2. * That power-of-2 length must be atleast (count+1), so that there will always be one empty slot. */ fields->lookup = hasharray_new(count, sizeof(uint16_t)); fields->field_count = count; } extern Field *fields_lookup_by_name(Fields *fields, const char *name) { HashType hash = str_to_hash(name); Field *field; uint32_t next = 0; uint16_t *rc; uint16_t idx; do { rc = hasharray_search(fields->lookup, hash, &next); if(rc == NULL) break; /* field not found. */ /* found a possible match. */ idx = *rc; assert(idx < fields->field_count); field = &(fields->list[idx]); /* compare lookup name with field name. */ if(strcmp(name, field->name) == 0) { /* found match. */ return field; } /* field didn't match continue search to next field. */ } while(1); return NULL; }

The worst search will be distorted before the linear search of the entire array, if it is 99% full and the key does not exist. If the keys are integers, then the linear search should not be bad, you also need to compare only the keys with the same value of the hash function. I try to have hasharrays so that they are only 70-80%, the space spent on empty slots is not so much if the values have only 16-bit values. With this design, you only use 4 bytes per empty slot when using 16-bit hashes and 16-bit index values. An array of objects (Field Structures in the above example) has no empty spaces.

Also, most hash table implementations I've seen do not store computed hashes and require the full key to be compared to resolve conflicts in buckets. Comparing hashes helps a lot since only a small portion of the hash value is used to find the bucket.

+1

Neopallium Nov 01 '10 at 9:22

source share

As others said, with linear sensing, when the load factor is about 1, the time complexity is close to linear search. (When it is full, its infinite.) There is a trade in the utility of memory. While a divided chain always gives us a theoretically constant time.

Typically, with linear sensing, it is recommended that the load factor be kept between 1/8 and 1/2. when the array is 1/2, we resize it to double the size of the original array. (Ref: Algorithms. Robert Sedgewick. Kevin Wayne.). When deleting, we resize the array to 1/2 of the original size. If you are really interested, it will be useful for you to start with the book mentioned above. In practice, he said 0.72 is an empirical value that we usually use.

0

Lowson pan Jun 21 '16 at 6:24

source share

user166390 · Accepted Answer · 2010-11-02T06:47:59+0000

Answering the question: what is the best way to handle collisions with a hashmap, when the load factor is close to 1 k , provides minimal memory wear?

Open addressing / sounding, which allows to obtain a high level of filling. . Because, as you yourself said, collisions do not require additional space (just, well, maybe time - of course, this also assumes the hash function is not ideal).

If you did not specify a "load factor close to 1" or include a "cost" in the question, then that would be completely different.

Happy coding.

Open addressing against individual chains

More articles: