Trie or balanced binary search tree for dictionary storage?

Question

Trie or balanced binary search tree for dictionary storage?

I have a simple requirement (possibly hypothetical):

I want to store a dictionary of an English word (n words) and set a word (character length m), the dictionary can say if the word exists in the dictionary or not. What would be a suitable data structure for this?

balanced binary search tree? as is done in C ++ STL associative data structures such as set, map

or

a trie on strings

Complexity analysis: in balanced bst, the time will be (log n) * m (2 line comparison takes O (m) time symbol by symbol)

in trie, if in each node we could branch out O (1) times, we can find using O (m), but the assumption that with each node we can enter O (1) time is invalid. on each node, the maximum possible branches will be 26. If we want O (1) in a node, we will store a short array indexed by the characters in each node. This will inflate the space. After several levels in trie, the branching will decrease, so it’s better to keep a linked list of the next node characters and pointers.

What looks more practical? any other trade-offs?

Thanks,

+7

dictionary algorithm data-structures tree

xyz Jun 08 '11 at 13:13

source share

5 answers

If it is C ++, you should also consider std::tr1::unordered_set . (If you have C ++ 0x, you can use std::unordered_set .)

It just uses a hash table inside that I would bet on - in practice, use any tree structure. It is also trivial to implement, because you have nothing to implement.

+4

Nemo Jun 08 '11 at 16:56

source share

Binary search will be easier to implement, and it will include the largest comparison of dozens of strings. Given that you know the data ahead, you can build a balanced binary tree so that performance is predictable and easy to understand.

With that in mind, I would use a standard binary tree (possibly using set from C ++, since this is usually implemented as a tree).

+2

Jeff foster Jun 08 '11 at 13:20

source share

A simple solution is to store the dict as sorted, \ n-separated words on disk, load them into memory, and perform a binary search. The only non-standard part here is that you need to scan back to the beginning of the word when you perform a binary search.

Here is the code! (It is assumed that the globals wordlist points to the loaded dict and wordlist_end , which points immediately after the end of the loaded dict.

 // Return >0 if word > word at position p. // Return <0 if word < word at position p. // Return 0 if word == word at position p. static int cmp_word_at_index(size_t p, const char *word) { while (p > 0 && wordlist[p - 1] != '\n') { p--; } while (1) { if (wordlist[p] == '\n') { if (*word == '\0') return 0; else return 1; } if (*word == '\0') { return -1; } int char0 = toupper(*word); int char1 = toupper(wordlist[p]); if (char0 != char1) { return (int)char0 - (int)char1; } ++p; ++word; } } // Test if a word is in the dictionary. int is_word(const char* word_to_find) { size_t index_min = 0; size_t index_max = wordlist_end - wordlist; while (index_min < index_max - 1) { size_t index = (index_min + index_max) / 2; int c = cmp_word_at_index(index, word_to_find); if (c == 0) return 1; // Found word. if (c < 0) { index_max = index; } else { index_min = index; } } return 0; }

The huge advantage of this approach is that the dict is stored on a disk with a readable language and that you do not need any fancy code to load it (allocate a block of memory and read () it at a time).

If you want to use trie, you can use a compressed and compressed suffix representation. Here is a link to one of Donald Knuth’s students, Franklin Liang, who wrote about this trick in his dissertation.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.7018&rep=rep1&type=pdf

It uses half the storage of the direct text representation of the dict, gives you the trie speed, and you can (for example, the text representation of the dict) store everything on disk and load it at a time.

The trick used is to pack all the trie nodes into one array, interleaving them where possible. Like the new pointer (and the bit-bit of the end of the word) at each location of the array, as in a regular trie, you store a letter for which this node is - this allows you to determine if the node is valid for your state, or if it is overlapping with a node . Read the related document for a more complete and clear explanation, as well as an algorithm for packing trie into this array.

It is not trivial to implement the described suffix-compression and greedy packing algorithm, but it is quite simple.

+2

user97370 Jun 08 '11 at 2:04

source share

The industry standard is to store the dictionary in a hash table and have an amortized search time of O (1). Space is no longer critical in the industry, especially due to the development of distribution computing.

hashtable is how Google implements its autocomplete function. In particular, each word prefix is a key and puts the word as a value in a hash table.

+1

Amm sokun Jun 18 '11 at 5:31

source share

luke · Accepted Answer · 2011-06-08T13:22:57+0000

I would say that I use Trie, or even better use its more economical place for my cousin Directed Acyclic Word Graph (DAWG) .

It has the same runtime characteristics (insert, look up, delete) as Trie, but it overlaps common suffixes, as well as common prefixes, which can be a big saving in space.

Trie or balanced binary search tree for dictionary storage?

More articles: