A spatial efficient data structure for storing a list of words?

Question

A spatial efficient data structure for storing a list of words?

Is there anything better than Trie for this situation?

Saving a list of ~ 100k English words
Minimum memory required
Looks should be reasonable, but not lightning fast

I work with Java, so my first attempt was just to use Set <String>. However, I am aiming at a mobile device and already works little in memory. Since many English words contain common prefixes, trie seems like a decent bet to save some memory - does anyone know some other good options?

EDIT - Additional Information - The data structure will be used for two operations

Answer: some word XYZ in the list?
Creating a neighborhood of words around XYZ with one letter to another

Thanks for the good suggestions.

+6

java data-structures

allclaws Dec 11 '08 at 1:54

source share

6 answers

One structure that I saw to minimize space in the spelling dictionary was to encode each word as:

number of characters (bytes), common with the last; and
new finale.

So a list of words

HERE would encode as THIS sanctimonious 0,sanctimonious sanction 6,on sanguine 3,guine trivial 0,trivial

You save 7 bytes right there (19%), I suspect that saving will be like a dictionary of 20,000 words only because of the minimal distances between (common prefixes) of adjacent words.

To speed up the search, in memory there was a table with 26 inputs, in which there were initial offsets for words starting with a, b, c, ..., z. Words at these offsets always had 0 as the first byte, since they did not have common letters with the previous word.

This looks like something like trie, but without pointers, which would undoubtedly be expensive if every character in the tree has a 4-byte pointer associated with it.

Remember that it was from my CP / M days, where the memory was much less than now.

+8

paxdiablo Dec 11 '08 at 2:26

source share

A Patricia trie might be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me that they were used in some early full-text search engines ...

Pavel.

+6

Paul W Homer Dec 11 '08 at 3:35

source share

You still need to maintain the tree structure with Trie. Huffman encoding alphabet or N-letters (for common forms, such as "tion", "un", "ing"), you can take advantage of the frequency of occurrence in your dictionary and compression of the record into bits.

+1

Eugene yokota Dec 11 '08 at 2:05

source share

Totally wild idea ... (that is, most likely, very wrong)

How to save words as a tree of all possible letter combinations?

Then each "word" costs only one char and two pointers (one to char and one to the terminator.) Thus, the more letters they have in common, the lower the cost for each word.

  . . / / rps-. /\\ a \s-. / t-. c \ s-.

car carp carps cars cart carts

So, for 9 characters and 14 pointers, we get 6 "words" of 25 letters.

Searches would be quick (looking for pointers instead of comparing char), and you could do some optimizations to save even more space ...?

EDIT: Looks like I reinvented the wheel .; -)

+1

Chris nava Dec 11 '08 at 4:17

source share

Refers to Paul post:

Any reason you cannot consider Trie in your case? If this is just an implementation problem, here is a tight implementation of the Patricia trie insert and search in C (from NIST):

Patricia insert in c

Search for Patricia in C

+1

Rich Dec 12 '08 at 14:58

source share

Mike scott · Accepted Answer · 2008-12-11T02:05:58+0000

What do you do? If it's a spellcheck, you can use a flowering filter - see this kata code .

A spatial efficient data structure for storing a list of words?

More articles: