Best data structure for vocabulary implementation?

What would be the best data structure for storing all dictionary words? The best I could think of is to use a HashMap that will display on a HashTable . Basically, depending on the first character, we get the associated HashTable , and then using this, we can add words starting with that character. Then we will choose a good hash function based on the string.

Is there a better approach?

+56
string dictionary algorithm data-structures
Apr 04 '12 at 19:18
source share
1 answer

Depending on what you want to do, there are many good data structures.

If you just want to keep the words and ask β€œis this word here or not?”, A standard hash table without any other fancy mechanisms is a smart approach. If this word is specified in advance, consider using an ideal hash table to get excellent performance and space usage.

If you want to check if a given prefix exists with quick search support, trie is a good option, although it can be a bit inefficient. It also supports quick insertion or deletion. It also allows iteration in alphabetical order, which hashing does not offer. This is essentially the structure you described in your answer, but depending on the use case, other representations of attempts may be better.

If, in addition to the above, you know that the word list is correct, consider using DAWG (directional acyclic text graph), which is essentially a DFA with minimal state for the language. It is significantly more compact than trie, but supports many of the same operations.

If you want trie behavior, but don’t want to pay a huge amount of fines, then the base tree . These are very different structures, but they can be much better than three in different circumstances.

If space is a concern, but you want to get a trie, look at a compressed trie representation, which has a slower search, but about the theoretically optimal space. The link discusses how it is used in JavaScript as an easy way to transfer a huge amount of data. An alternative compact representation is the double-array trie , although admittedly I know very little about it.

If you want to use the dictionary for operations such as spell checking, where you need to find words similar to other words, BK-tree is a great data structure to consider.

Hope this helps!

+126
Apr 04 '12 at 19:21
source share



All Articles