In both of your trials, which is noticeable, most of the words are spelled correctly. Therefore, you should focus on optimizing the search for words that are in the dictionary.
In your first test, for example, only 1.5% of all words are spelled. Suppose that, on average, twice a search is required for a word that is not in the dictionary (because each word in the bucket needs to be checked). Even if you reduce it to 0 (theoretical minimum :)), you will speed up your program by less than 3%.
A general optimization of the hash table is to move the key you find to the top of the bucket chain if it does not already exist. This will tend to reduce the number of hash entries checked for common words. This is not a huge acceleration, but in cases where some keys are viewed much more often than others, this can definitely be noticed.
Reducing the length of the chain by reducing the loading of the hash table can help more, due to more memory.
Another possibility, since you are not going to change the dictionary after its creation, is to store each bucket chain in continuous memory without pointers. This will not only reduce memory consumption, but also improve cache performance, because since most words are short, most buckets will be on the same cache line.
And since the words are usually quite short, you can find a way to optimize the comparison. strcmp() well optimized, but it is usually optimized for large strings. If you are allowed to use it, the SSE4.2 PCMPESTRI opcode will be incredibly powerful (but figuring out what it does and how to use it to solve your problem can be a huge time). More simply, you should be able to compare four eight-byte prefixes simultaneously with 256-bit comparison operations (and you can even have access to operations with 512 bits), so with smart data composition, you can completely compare the entire vessel in parallel.
Not to say that hash tables are necessarily the optimal data structure for this problem. But remember that the more you can do in one cache line, the faster your program will work. Intensive catalog-linked directories can be suboptimal, even if they look good on paper.
After thinking about this problem for a couple of days and actually writing some code, I came to the conclusion that optimization for a successful hash page search speed is probably not suitable for real-world authentication. It’s true that most of the words in the text you’re viewing are usually spelled correctly - although it depends on the spell checking user - but an algorithm that tries to suggest the correct spelling is likely to make many unsuccessful searches, as it cycles through possible spelling errors, I know that this is probably outside the scope of this problem, but it matters for optimization, because in the end you get two completely different strategies.
If you are trying to quickly reject, you need a lot of empty bucket chains or a Bloom filter or its moral equivalent, so you can reject most errors on the first probe.
For example, if you have a good hashing algorithm that gives more bits than you need - and you almost certainly do it, because the spelling dictionaries are not so big - then you can just use some otherwise unused bits in the hash for the secondary hash. Without even creating a problem with implementing the entire Bloom filter, you can simply add, say, a 32-bit mask to each bucket that represents the possible values ​​of the five secondary hash bits in the values ​​stored in this bucket. In combination with a rare table - I used 30% employment for the experiment, which is not so meager - you should be able to refuse 80-90% of search failures without going beyond the bucket header.
On the other hand, if you are trying to optimize success, it may turn out that rather large buckets are better, because it reduces the number of bucket headers and improves cache utilization. As long as the entire bucket fits into the cache line, the speed of multiple comparisons is so high that you won't notice the difference. (And since words tend to be short, it is reasonable to expect five or six to fit on a 64-byte scale.)
In any case, without spending too much work, I was able to perform a millionth search in 70 milliseconds of the processor. Multiprocessing can speed up the elapsed time quite a bit, especially if locking is not required, given that the hash table is unchanged.
The moral I want to learn from this:
To optimize:
you need to understand your data
you need to understand your intended use pattern
you need to adapt your algorithms based on the above
you need to experiment a lot.