A spatial efficient data structure for storing a list of words?

Is there anything better than Trie for this situation?

  • Saving a list of ~ 100k English words
  • Minimum memory required
  • Looks should be reasonable, but not lightning fast

I work with Java, so my first attempt was just to use Set <String>. However, I am aiming at a mobile device and already works little in memory. Since many English words contain common prefixes, trie seems like a decent bet to save some memory - does anyone know some other good options?

EDIT - Additional Information - The data structure will be used for two operations

  • Answer: some word XYZ in the list?
  • Creating a neighborhood of words around XYZ with one letter to another

Thanks for the good suggestions.

+6
java data-structures
source share
6 answers

What do you do? If it's a spellcheck, you can use a flowering filter - see this kata code .

+3
source share

One structure that I saw to minimize space in the spelling dictionary was to encode each word as:

  • number of characters (bytes), common with the last; and
  • new finale.

So a list of words

HERE would encode as THIS sanctimonious 0,sanctimonious sanction 6,on sanguine 3,guine trivial 0,trivial 

You save 7 bytes right there (19%), I suspect that saving will be like a dictionary of 20,000 words only because of the minimal distances between (common prefixes) of adjacent words.

To speed up the search, in memory there was a table with 26 inputs, in which there were initial offsets for words starting with a, b, c, ..., z. Words at these offsets always had 0 as the first byte, since they did not have common letters with the previous word.

This looks like something like trie, but without pointers, which would undoubtedly be expensive if every character in the tree has a 4-byte pointer associated with it.

Remember that it was from my CP / M days, where the memory was much less than now.

+8
source share

A Patricia trie might be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me that they were used in some early full-text search engines ...

Pavel.

+6
source share

You still need to maintain the tree structure with Trie. Huffman encoding alphabet or N-letters (for common forms, such as "tion", "un", "ing"), you can take advantage of the frequency of occurrence in your dictionary and compression of the record into bits.

+1
source share

Totally wild idea ... (that is, most likely, very wrong)

How to save words as a tree of all possible letter combinations?

Then each "word" costs only one char and two pointers (one to char and one to the terminator.) Thus, the more letters they have in common, the lower the cost for each word.

  . . / / rps-. /\\ a \s-. / t-. c \ s-. 

car carp carps cars cart carts

So, for 9 characters and 14 pointers, we get 6 "words" of 25 letters.

Searches would be quick (looking for pointers instead of comparing char), and you could do some optimizations to save even more space ...?

EDIT: Looks like I reinvented the wheel .; -)

+1
source share

Refers to Paul post:

Any reason you cannot consider Trie in your case? If this is just an implementation problem, here is a tight implementation of the Patricia trie insert and search in C (from NIST):

Patricia insert in c

Search for Patricia in C

+1
source share

All Articles