The second solution for Anna is the inspiration for this.
First load all the words into memory and divide the dictionary into sections based on the word length.
For each length, make n copies of the array of pointers to words. Sort each array so that the lines appear in order when they are rotated by a certain number of letters. For example, suppose an initial list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five pointer arrays will be:
rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train] rotated by 1 letter: [hacks, happy, plane, space, apple, train, stack] rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy] rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy] rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]
(Instead of pointers, you can use integers that identify words if this saves space on your platform.)
To perform a search, just ask how much you need to rotate the template so that question marks appear at the end. Then you can perform a binary search in the corresponding list.
If you need to find matches for? ppy, you have to rotate this by 2 to make ppy ??. So, look at the array, which is in order when rotated by 2 letters. A quick binary search reveals that โhappyโ is the only match.
If you need to find matches for th? g, you have to turn it 4 to do gth ??. So, look at array 4, where the binary search detects that there are no matches.
This works regardless of the number of question marks if they all appear together.
Space is required in addition to the dictionary itself: for words of length N, this requires space for (N times the number of words of length N) pointers or integers.
Time to search : O (log n), where n is the number of words of the corresponding length.
Python implementation:
import bisect class Matcher: def __init__(self, words): # Sort the words into bins by length. bins = [] for w in words: while len(bins) <= len(w): bins.append([]) bins[len(w)].append(w) # Make n copies of each list, sorted by rotations. for n in range(len(bins)): bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)] self.bins = bins def find(self, pattern): bins = self.bins if len(pattern) >= len(bins): return [] # Figure out which array to search. r = (pattern.rindex('?') + 1) % len(pattern) rpat = (pattern[r:] + pattern[:r]).rstrip('?') if '?' in rpat: raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern)) a = bins[len(pattern)][r] # Binary-search the array. class RotatedArray: def __len__(self): return len(a) def __getitem__(self, i): word = a[i] return word[r:] + word[:r] ra = RotatedArray() start = bisect.bisect(ra, rpat) stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1)) # Return the matches. return a[start:stop] words = open('/usr/share/dict/words', 'r').read().split() print "Building matcher..." m = Matcher(words) # takes 1-2 seconds, for me print "Done." print m.find("st??k") print m.find("ov???low")
On my computer, the system dictionary is 909 KB in size, and this program uses about 3.2 MB of memory in addition to what is needed only for storing words (pointers - 4 bytes). For this dictionary, you can cut it in half by using 2-byte integers instead of pointers, because there are less than 2 16 words of each length.
Measurements: On my machine, m.find("st??k") works for 0.000032 seconds, m.find("ov???low") for 0.000034 seconds and m.find("????????????????e") in 0.000023 seconds.
By choosing a binary search instead of using the class RotatedArray and the bisect library, I got these first two numbers up to 0.000016 seconds: twice as fast. Implementing this in C ++ will make it even faster.