A hashmap that is complete degrades into a linear search, so you want to keep them at 90%.
You are right about open addressing using less memory; for a chain, each node field requires a pointer or offset field.
I created a hasharray data structure when I need very lightweight hash tables that don't have many inserts. To keep memory usage low, all data is embedded in the same memory block, first with a HashArray structure, then two arrays for hashes and values. Hasharray can only be used with a search key that is stored in a value.
typedef uint16_t HashType; typedef uint16_t HashSize; struct HashArray { HashSize length; HashSize count; uint16_t value_size; uint16_t hashs[length]; uint8_t values[length * value_size]; }; #define hasharray_get_hashs(array) (HashType *)(((uint8_t *)(array)) + sizeof(HashArray)) #define hasharray_get_values(array) ((uint8_t *)(array)) + sizeof(HashArray) + \ ((array)->length * sizeof(HashType)) #define hasharray_get_value(array, idx) (hasharray_get_values(array) + ((idx) * (array)->value_size))
The hasharray_get_hashs and hasharray_get_values macros are used to get the hashs and values arrays.
I used this to quickly search for complex objects that are already stored in an array. Objects have a "name" field that is used for searching. Names are hashed and inserted into the hasharray with an index of objects. The values stored in hasharray can be indexes / pointers / integer objects (I use only small 16-bit index values).
If you want to pack a hashray until it is fully populated, you will want to use full 32-bit hashes instead of the 16-bit ones defined above. Large 32-bit hashes will facilitate a quick search when the hash is more than 90% full.
A hashhare, as defined above, can contain a maximum of 65535, which is great, since I never use it on anything that has more than a few hundred values. All that requires more, which should just use a regular hash table. But if memory is really a problem, the HashSize type can be changed to 32 bits. In addition, I use power-of-2 lengths to quickly find hash searches. Some people prefer to use the length of the main bucket, but this is only necessary if the hash function has a poor distribution.
#define hasharray_empty_hash 0xFFFF /* hash value to mark empty slots. */ void *hasharray_search(HashArray *array, HashType hash, uint32_t *next) { HashType *hashs = hasharray_get_hashs(array); uint32_t mask = array->length - 1; uint32_t start_idx; uint32_t idx; hash = (hash == hasharray_empty_hash) ? 0 : hash; /* need one hash value to mark empty slots. */ start_hash_idx = (hash & mask); if(*next == 0) { idx = start_idx; /* new search. */ } else { idx = *next & mask; /* continuing search to next slot. */ } /* find hash in hash array. */ do { /* check for hash match. */ if(hashs[idx] == hash) goto found_hash; /* check for end of chain. */ if(hashs[idx] == hasharray_empty_hash) break; idx++; idx &= mask; } while(idx != start_idx); /* maximum tries reached (ie did a linear search of whole array) or end of chain. */ return NULL; found_hash: *next = idx + 1; /* where to continue search at, if this is not the right value. */ return hasharray_get_values(array) + (idx * array->value_size); }
a hash collision will occur, so the code that calls hasharray_search () should compare the search key with the one stored in the value object. If they do not match, then hasharray_search () is called again. Non-unique keys may also exist, as the search can continue until a “NULL” value is returned to find all values matching a single key. The search function uses linear sensing for regular caching.
typedef struct { char *name; char *type; } Field; typedef struct { Field *list; HashArray *lookup; uint32_t field_count; } Fields; extern Fields *fields_new(uint16_t count) { Fields *fields; fields = calloc(1, sizeof(Fields)); fields->list = calloc(count, sizeof(Field)); fields->lookup = hasharray_new(count, sizeof(uint16_t)); fields->field_count = count; } extern Field *fields_lookup_by_name(Fields *fields, const char *name) { HashType hash = str_to_hash(name); Field *field; uint32_t next = 0; uint16_t *rc; uint16_t idx; do { rc = hasharray_search(fields->lookup, hash, &next); if(rc == NULL) break; idx = *rc; assert(idx < fields->field_count); field = &(fields->list[idx]); if(strcmp(name, field->name) == 0) { return field; } } while(1); return NULL; }
The worst search will be distorted before the linear search of the entire array, if it is 99% full and the key does not exist. If the keys are integers, then the linear search should not be bad, you also need to compare only the keys with the same value of the hash function. I try to have hasharrays so that they are only 70-80%, the space spent on empty slots is not so much if the values have only 16-bit values. With this design, you only use 4 bytes per empty slot when using 16-bit hashes and 16-bit index values. An array of objects (Field Structures in the above example) has no empty spaces.
Also, most hash table implementations I've seen do not store computed hashes and require the full key to be compared to resolve conflicts in buckets. Comparing hashes helps a lot since only a small portion of the hash value is used to find the bucket.