Good algorithm and data structure for finding words with missing letters?

Question

Good algorithm and data structure for finding words with missing letters?

so I need to write an effective algorithm for finding words with missing letters in the dictionary, and I want a set of possible words.

For example, if I have, I can return them, those topic, there.etc.

I was wondering if anyone could suggest some data structures or an algorithm that I should use.

Thank!

EDIT: Trie is too inefficient and will make it too slow. Any other modifications to the ideas?

UPDATE: there will be up to two question marks, and when two question marks appear, they will be executed sequentially.

I currently use 3 hash tables when this is an exact match, 1 question mark and 2 questions. Given the dictionary, I use all possible words. For example, if I have the word WORD. I haveh WORD ,? ORD, W? RD, WO? D, WOR ?, RD, W? D, WO ?. to the dictionary. Then I use the list of links to link the conflicts together. So let's say hash (W? RD) = hash (STR? NG) = 17. hashtab (17) points to WORD, and WORD points to STRING because it is a linked list.

The average search term for one word is about 2e-6s. I try to do better, preferably about 1e-9.

EDIT: I haven't addressed this issue yet, but it took 0.5 seconds to insert 3m records, and it took 4 seconds to search for 3m records.

Thank!

+56

algorithm data-structures

SuperString Dec 23 '09 at 14:24

source share

20 answers

Given the current limitations:

Up to two question marks will be found
When there are 2 questions, they appear together
There are ~ 100,000 words in the dictionary, the average word length is 6.

I have two viable solutions for you:

Quick fix: HASH

You can use a hash, which keys are your words, up to two “?”, And the values are a list of matching words. This hash will have about 100,000 + 100,000 * 6 + 100,000 * 5 = 1,200,000 entries (if you have 2 questions, you just need to find the place of the first ...). Each entry can save a list of words or a list of pointers to existing words. If you save a list of pointers, and suppose that on average there are less than 20 words corresponding to each word with two “?”, Then the extra memory is less than 20 * 1,200,000 = 24,000,000.

If each pointer size is 4 bytes, then the memory requirement is required (24,000,000 + 1,200,000) * 4 bytes = 100,800,000 bytes ~ = 96 megabytes.

To summarize this decision:

Memory consumption: ~ 96 MB
Time for every search: hash function calculation and pointer. O (1)

Note. If you want to use a smaller hash, you can, but then it’s better to keep a balanced search tree in each entry instead of a linked list for better performance.

Space, but still a very quick solution: TRIE variation

This solution uses the following observation:

If '?' the signs were at the end of the word, trie would be a great solution.

A search in trie will look for the length of the word, and for the last two letters, a DFS traversal will bring all the endings. A very quick and very smart decision.

So, let's use this observation to build something to work that way.

You can think of every word that you have in the dictionary, like a word ending with @ (or any other character that does not exist in your dictionary). So the word "space" will be "space @". Now, if you rotate each of the words with the @ sign, you get the following:

 space@, pace@s, ace@sp, *ce@spa*, e@spac

(no @ as the first letter).

If you insert all of these variations into TRIE, you can easily find the word you are looking for by the length of the word by “turning” your word.

Example: You want to find all words that match "s" (one of them is space, the other is slice). Are you building the word: s? Ce @ and rotate it so that? sign at the end. those. 'ce @s ??'

All rotation variations exist inside the trie, specifically "ce @spa" (marked * above). After starting the search, you need to go through all the extensions in the appropriate length and save them. Then you need to rotate them again so that @ is the last letter, and walla - you have all the words you were looking for!

To summarize this decision:

Memory Consumption: For each word, all its rotations appear in the trie. On average, * 6 memory is stored in trie. The size of three is about 3 (only divination ...) of the space stored inside it. Thus, the total space required for this three is 6 * 3 * 100,000 = 1,800,000 words ~ = 6.8 megabytes.
Time for each search:
- word rotation: O (word length)
- start search in trie: O (word length)
- iterates over all endings: O (number of matches)
- ending rotation: O (total length of responses)
To summarize, it is very fast and depends on the length of the word * small constant.

Summarizing...

The second choice has a great complexity of time and space and will be the best option for you. There are several problems with the second solution (in this case you can use the first solution):

More difficult to implement. I am not sure if there are programming languages with built-in firmware. If not, it means that you will need to implement it yourself ...
Does not scale well. If tomorrow you decide that you need your question marks, which are distributed throughout the word, and are not necessarily combined together, you will need to think a lot about how to fit the second solution. In the case of the first solution - it is quite easy to generalize.

+24

Anna Dec 30 '09 at 18:50

source share

To me this problem sounds like a good fit for the Trie data structure. Enter the entire dictionary into your trick, and then find the word. For the missing letter, you will have to try all the subtopics that are relatively easy to do with the recursive approach.

EDIT . I wrote a simple implementation of this in Ruby just now: http://gist.github.com/262667 .

+22

DataWraith Dec 23 '09 at 15:07

source share

Directed Acyclic Word Graph will be the ideal data structure for this problem. It combines the effectiveness of trie (trie can be considered as a special case of DAWG), but much more efficient than space. A typical DAWG will take part of the size in which a regular text file with words will be used.

Listing words that meet certain conditions is simple and the same as in trie - you need to cross the graph in depth.

+16

el.pescado Dec 24 '09 at 11:45

source share

The second solution for Anna is the inspiration for this.

First load all the words into memory and divide the dictionary into sections based on the word length.

For each length, make n copies of the array of pointers to words. Sort each array so that the lines appear in order when they are rotated by a certain number of letters. For example, suppose an initial list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five pointer arrays will be:

 rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train] rotated by 1 letter: [hacks, happy, plane, space, apple, train, stack] rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy] rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy] rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]

(Instead of pointers, you can use integers that identify words if this saves space on your platform.)

To perform a search, just ask how much you need to rotate the template so that question marks appear at the end. Then you can perform a binary search in the corresponding list.

If you need to find matches for? ppy, you have to rotate this by 2 to make ppy ??. So, look at the array, which is in order when rotated by 2 letters. A quick binary search reveals that “happy” is the only match.

If you need to find matches for th? g, you have to turn it 4 to do gth ??. So, look at array 4, where the binary search detects that there are no matches.

This works regardless of the number of question marks if they all appear together.

Space is required in addition to the dictionary itself: for words of length N, this requires space for (N times the number of words of length N) pointers or integers.

Time to search : O (log n), where n is the number of words of the corresponding length.

Python implementation:

 import bisect class Matcher: def __init__(self, words): # Sort the words into bins by length. bins = [] for w in words: while len(bins) <= len(w): bins.append([]) bins[len(w)].append(w) # Make n copies of each list, sorted by rotations. for n in range(len(bins)): bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)] self.bins = bins def find(self, pattern): bins = self.bins if len(pattern) >= len(bins): return [] # Figure out which array to search. r = (pattern.rindex('?') + 1) % len(pattern) rpat = (pattern[r:] + pattern[:r]).rstrip('?') if '?' in rpat: raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern)) a = bins[len(pattern)][r] # Binary-search the array. class RotatedArray: def __len__(self): return len(a) def __getitem__(self, i): word = a[i] return word[r:] + word[:r] ra = RotatedArray() start = bisect.bisect(ra, rpat) stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1)) # Return the matches. return a[start:stop] words = open('/usr/share/dict/words', 'r').read().split() print "Building matcher..." m = Matcher(words) # takes 1-2 seconds, for me print "Done." print m.find("st??k") print m.find("ov???low")

On my computer, the system dictionary is 909 KB in size, and this program uses about 3.2 MB of memory in addition to what is needed only for storing words (pointers - 4 bytes). For this dictionary, you can cut it in half by using 2-byte integers instead of pointers, because there are less than 2 ¹⁶ words of each length.

Measurements: On my machine, m.find("st??k") works for 0.000032 seconds, m.find("ov???low") for 0.000034 seconds and m.find("????????????????e") in 0.000023 seconds.

By choosing a binary search instead of using the class RotatedArray and the bisect library, I got these first two numbers up to 0.000016 seconds: twice as fast. Implementing this in C ++ will make it even faster.

+9

Jason Orendorff Dec 30 '09 at 19:08

source share

First, we need a way to compare the query string with this record. Let a function be used using regular expressions: matches(query,trialstr) .

The O (n) algorithm should consist in simply passing through each element of the list (your dictionary will be presented as a list in the program), comparing each with your query string.

With a few preliminary calculations, you can improve this for a large number of queries by creating an additional list of words for each letter, so your dictionary might look like this:

 wordsbyletter = { 'a' : ['aardvark', 'abacus', ... ], 'b' : ['bat', 'bar', ...], .... }

However, this would be limited use, especially if the query string starts with an unknown character. Therefore, we can do even better by noticing where a particular letter lies in a given word, generating:

 wordsmap = { 'a':{ 0:['aardvark', 'abacus'], 1:['bat','bar'] 2:['abacus']}, 'b':{ 0:['bat','bar'], 1:['abacus']}, .... }

As you can see, without the use of indexes, you will significantly increase the amount of required storage space - in particular, a dictionary of n words and an average length m will require storage of nm ² . However, you can very quickly make your own look to get all the words from each set that can match.

Final optimization (which you can use with a bat in a naive approach) should also separate all words of the same length with separate stores, since you always know how long the word is.

This version will be O ( kx ), where k is the number of known letters in the query word, and x = x ( n ) is the search time for one element in a dictionary of length n in your implementation (usually enter (<i> n).

So, with the last dictionary like:

 allmap = { 3 : { 'a' : { 1 : ['ant','all'], 2 : ['bar','pat'] } 'b' : { 1 : ['bar','boy'], ... } 4 : { 'a' : { 1 : ['ante'], ....

Then our algorithm is valid:

 possiblewords = set() firsttime = True wordlen = len(query) for idx,letter in enumerate(query): if(letter is not '?'): matchesthisletter = set(allmap[wordlen][letter][idx]) if firsttime: possiblewords = matchesthisletter else: possiblewords &= matchesthisletter

At the end, the set of possiblewords will contain all the corresponding letters.

+4

Phil H Dec 23 '09 at 14:42

source share

If you create all possible words that match the pattern (arate, arbte, arcte ... zryte, zrzte), then look at them in the binary representation of the dictionary tree, which will have average performance O (e ^ N1 * log (N2)) , where N1 is the number of question marks, and N2 is the size of the dictionary. It seems good enough for me, but I'm sure you can find a better algorithm.

EDIT . If you have more than three questions to say, look at Phil X's answer and his approach to indexing a letter.

+3

Tamas Czinege Dec 23 '09 at 14:33

source share

Assuming you have enough memory, you can build a giant hash file to provide a constant response. Here is a quick example in python:

 from array import array all_words = open("english-words").read().split() big_map = {} def populate_map(word): for i in range(pow(2, len(word))): bin = _bin(i, len(word)) candidate = array('c', word) for j in range(len(word)): if bin[j] == "1": candidate[j] = "?" if candidate.tostring() in big_map: big_map[candidate.tostring()].add(word) else: big_map[candidate.tostring()] = set([word]) def _bin(x, width): return ''.join(str((x>>i)&1) for i in xrange(width-1,-1,-1)) def run(): for word in all_words: populate_map(word) run() >>> big_map["y??r"] set(['your', 'year']) >>> big_map["yo?r"] set(['your']) >>> big_map["?o?r"] set(['four', 'poor', 'door', 'your', 'hour'])

+3

Jiayao Yu Dec 23 '09 at 15:29

source share

You can see how this is done in aspell . It offers sentences of the correct word for misspelled words.

+2

Andreas Bonini Dec 23 '09 at 2:30 p.m.

source share

Create a hash set of all words. To find matches, replace the question marks in the template with each possible combination of letters. If there are two question marks, the query consists of 26 ² = 676 fast, hash tables with a constant expected time.

 import itertools words = set(open("/usr/share/dict/words").read().split()) def query(pattern): i = pattern.index('?') j = pattern.rindex('?') + 1 for combo in itertools.product('abcdefghijklmnopqrstuvwxyz', repeat=ji): attempt = pattern[:i] + ''.join(combo) + pattern[j:] if attempt in words: print attempt

This uses less memory than my other answer, but it is exponentially slower as you add extra question marks.

+2

Jason Orendorff Dec 30 '09 at 21:26

source share

If an accuracy of 80-90% is acceptable, you can consult Peter Norvig spell checker . The implementation is small and elegant.

+2

duffymo Jan 03 '09 at 12:48

source share

A regular expression solution will consider all the possible meanings in your dictionary. If performance is your biggest limitation, you can create an index to speed it up significantly.

You can start with an index for each word length containing the index of each set of matches with index = character. For a length of 5 words, for example, 2=r : {write, wrote, drate, arete, arite}, 3=o : {wrote, float, group} , etc. To get possible matches for the original query, say '? Ro ?? ', the word sets will intersect, the result is {wrote, group} in this case.

It is assumed that the only wildcard will be one character and that the word length is known in advance. If these are unacceptable assumptions, I can recommend text matching based on n-grams, for example, in this article .

+1

James Kolpack Dec 23 '09 at 14:56

source share

The data structure you want is called trie - see the wikipedia article for a brief summary.

A trie is a tree structure in which paths through a tree form the set of all the words that you want to code - each node can have up to 26 children, for each possible letter in the next position of the character. See the Chart in the wikipedia article to understand what I mean.

+1

Dave Kirby Dec 23 '09 at 18:33

source share

Here is how I would do it:

Combine dictionary words into one long string, separated by a character other than the word.
Put all the words in TreeMap, where the key is the word and the value is the offset of the beginning of the word in a large line.
Find the base of the search string; that is, the largest leading substring that does not include '?' .
Use TreeMap.higherKey(base) and TreeMap.lowerKey(next(base)) to find the range within the string between which matches will be found. (The next method should calculate the next larger word in the base line with the same number or fewer characters, for example next("aa") is "ab" , next("az") is "b" .)
Matcher.find() , .

1 2 , , O(NlogN) , N - .

, '?' , , .

:

, '?' , , , "a", "b" .., , -? . , -? , .., , "" .

, , , , , , . , 2 , . 3 , , . , , -? . N , Nth , N- .

+1

Stephen C 30 . '09 7:02

source share

, , , ? . .

: (@) ! . "th? E @" , "-" ( e @th??) , 5, , "e @th". , .. "e @thoo (thoose), e @thes (these) ..

Log (N), N - , 6 ( )

+1

ragnarius 30 . '09 15:15

source share

Ternary Search Tree ? trie, .

, .

+1

fishlips 30 . '09 18:29

source share

: , . 2N ; ; , "", - . , .

: . : , , . , , . , : ; , ( , , ABCD?). , , , ; Python , . ( , , .)

The code

 import bisect, re def forward(string): return string def reverse(string): return string[::-1] index_forward = sorted(line.rstrip('\n') for line in open('/usr/share/dict/words')) index_reverse = sorted(map(reverse, index_forward)) def lookup(pattern): "Return a list of the dictionary words that match pattern." if reverse(pattern).find('?') <= pattern.find('?'): key, index, fixup = pattern, index_forward, forward else: key, index, fixup = reverse(pattern), index_reverse, reverse assert all(c.isalpha() or c == '?' for c in pattern) lo = bisect.bisect_left(index, key.replace('?', 'A')) hi = bisect.bisect_right(index, key.replace('?', 'z')) r = re.compile(pattern.replace('?', '.') + '$') return filter(r.match, (fixup(index[i]) for i in range(lo, hi)))

: ( , AB? D?, .)

 >>> lookup('hello') ['hello'] >>> lookup('??llo') ['callo', 'cello', 'hello', 'uhllo', 'Rollo', 'hollo', 'nullo'] >>> lookup('hel??') ['helio', 'helix', 'hello', 'helly', 'heloe', 'helve'] >>> lookup('he?l') ['heal', 'heel', 'hell', 'heml', 'herl'] >>> lookup('hx?l') []

: 2N , ( ). "??", 44062 235k-word/usr/share/dict/words; , , "h?? lo", 190, " ", . , -, .

rotations-index , 10N 2N (, 10, /usr/share/dict/words ).

, , , ( ).

+1

Darius Bacon 30 . '09 21:39

source share

- SQLite .

, , LIKE.

+1

Benoit Vidis 04 . '10 10:51

source share

? , * , , : . , , , drate, arete, arite, :

 Character Index 0: 'a' -> {"arete", "arite"} 'd' -> {"drate"} 'w' -> {"write", "wrote"} Character Index 1: 'r' -> {"write", "wrote", "drate", "arete", "arite"} Character Index 2: 'a' -> {"drate"} 'e' -> {"arete"} 'i' -> {"write", "arite"} 'o' -> {"wrote"} ...

a?i?? , , 0 = > 'a' { "iste", "arite" } , 2 = 'i' = > { "write", "arite" } .

0

nikie 23 . '09 15:50

source share

- ( , -, -, - AI - -, ) threading [ ] + , . , , , .

, , , , , , , .

Another thought that I had was that you were trying to overdo something - perhaps create a database or a list or something to scratch?

0

RCIX Jan 03 '09 at 14:15

source share

martinus · Accepted Answer · 2009-12-23 20:29

I believe that in this case it is best to use a flat file, where each word is on the same line. Thanks to this, you can conveniently use the regular expression search capabilities, which are highly optimized and are likely to beat any data structure that you can develop for this problem.

Solution # 1: Using Regex

This Ruby working code for this problem:

def query(str, data) r = Regexp.new("^#{str.gsub("?", ".")}$") idx = 0 begin idx = data.index(r, idx) if idx yield data[idx, str.size] idx += str.size + 1 end end while idx end start_time = Time.now query("?r?te", File.read("wordlist.txt")) do |w| puts w end puts Time.now - start_time

The wordlist.txt file contains 45,425 words (can be downloaded here ). Program output for query ?r?te :

 brute crate Crete grate irate prate write wrote 0.013689

Thus, it takes only 37 milliseconds to read the entire file and find all matches in it. And it scales very well for all types of query patterns, even if Trie is very slow:

request ????????????????e

 counterproductive indistinguishable microarchitecture microprogrammable 0.018681

query ?h?a?r?c?l?

 theatricals 0.013608

It looks fast enough for me.

Solution # 2: regex with ready-made data

If you want to go even faster, you can break the list of words into lines containing words of equal length, and simply search for the right one based on the length of the query. Replace the last 5 lines with this code:

 def query_split(str, data) query(str, data[str.length]) do |w| yield w end end # prepare data data = Hash.new("") File.read("wordlist.txt").each_line do |w| data[w.length-1] += w end # use prepared data for query start_time = Time.now query_split("?r?te", data) do |w| puts w end puts Time.now - start_time

Building a data structure takes about 0.4 seconds, but all queries are about 10 times faster (depending on the number of words with this length):

?r?te 0.001112 sec
?h?a?r?c?l? 0,000852 sec
????????????????e 0.000169 sec

Solution # 3: one large hash table (updated requirements)

Since you changed your requirements, you can easily expand your idea using only one large hash table that contains all the pre-calculated results. But instead of working independently with collisions, you can rely on the performance of a properly implemented hash table.

Here I create one large hash table where each possible query displays a list of its results:

 def create_big_hash(data) h = Hash.new do |h,k| h[k] = Array.new end data.each_line do |l| w = l.strip # add all words with one ? w.length.times do |i| q = String.new(w) q[i] = "?" h[q].push w end # add all words with two ?? (w.length-1).times do |i| q = String.new(w) q[i, 2] = "??" h[q].push w end end h end # prepare data t = Time.new h = create_big_hash(File.read("wordlist.txt")) puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash" # use prepared data for query t = Time.new h["?ood"].each do |w| puts w end puts (Time.new - t)

Exit

 4.960255 sec preparing data 616745 entries in big hash food good hood mood wood 2.0e-05

The query performance is O (1), it is just a search in a hash table. Time 2.0e-05 is probably below the accuracy of the timer. When working 1000 times, I get an average of 1,958e-6 seconds per request. To speed things up, I switched to C ++ and used Google Sparse Hash , which is extremely efficient in working with memory and fast.

Solution # 4: get really serious

All of the above solutions work and should be good enough for many use cases. If you really want to become serious and have a lot of free time on your hands, read the good documents:

trying to approximate the correspondence of lines - if it is well implemented, attempts can have very compact memory requirements (50% less space than the dictionary itself), and very quickly.
Agrep is a quick approximate pattern matching tool - Agrep is based on a new, efficient and flexible algorithm for approximating string matching.
Google Scholar is looking for approximate string matching - more than enough to read on this topic.

Good algorithm and data structure for finding words with missing letters?

Solution # 1: Using Regex

Solution # 2: regex with ready-made data

Solution # 3: one large hash table (updated requirements)

Solution # 4: get really serious

Quick fix: HASH

Space, but still a very quick solution: TRIE variation

Summarizing...

More articles: