A simple solution is to store the dict as sorted, \ n-separated words on disk, load them into memory, and perform a binary search. The only non-standard part here is that you need to scan back to the beginning of the word when you perform a binary search.
Here is the code! (It is assumed that the globals wordlist points to the loaded dict and wordlist_end , which points immediately after the end of the loaded dict.
// Return >0 if word > word at position p. // Return <0 if word < word at position p. // Return 0 if word == word at position p. static int cmp_word_at_index(size_t p, const char *word) { while (p > 0 && wordlist[p - 1] != '\n') { p--; } while (1) { if (wordlist[p] == '\n') { if (*word == '\0') return 0; else return 1; } if (*word == '\0') { return -1; } int char0 = toupper(*word); int char1 = toupper(wordlist[p]); if (char0 != char1) { return (int)char0 - (int)char1; } ++p; ++word; } } // Test if a word is in the dictionary. int is_word(const char* word_to_find) { size_t index_min = 0; size_t index_max = wordlist_end - wordlist; while (index_min < index_max - 1) { size_t index = (index_min + index_max) / 2; int c = cmp_word_at_index(index, word_to_find); if (c == 0) return 1; // Found word. if (c < 0) { index_max = index; } else { index_min = index; } } return 0; }
The huge advantage of this approach is that the dict is stored on a disk with a readable language and that you do not need any fancy code to load it (allocate a block of memory and read () it at a time).
If you want to use trie, you can use a compressed and compressed suffix representation. Here is a link to one of Donald Knuthβs students, Franklin Liang, who wrote about this trick in his dissertation.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.7018&rep=rep1&type=pdf
It uses half the storage of the direct text representation of the dict, gives you the trie speed, and you can (for example, the text representation of the dict) store everything on disk and load it at a time.
The trick used is to pack all the trie nodes into one array, interleaving them where possible. Like the new pointer (and the bit-bit of the end of the word) at each location of the array, as in a regular trie, you store a letter for which this node is - this allows you to determine if the node is valid for your state, or if it is overlapping with a node . Read the related document for a more complete and clear explanation, as well as an algorithm for packing trie into this array.
It is not trivial to implement the described suffix-compression and greedy packing algorithm, but it is quite simple.
user97370
source share