What you can do to optimize your search depends heavily on how you want to limit the use of these wildcards.
Exactly: what are the characteristics of your wildcards?
- prefix only wildcards (m /.+ foobar /)
- suffix-only wildcards (m / foobar. + /)
- atomic symbols (
m/./ ) - dynamic wildcards (
m/.+/ )
Wildcards
Use a prefix tree or DAWG
Wildcard suffix only
Use the suffix tree or DAWG
Atomic wildcards
One way to drastically reduce the number of matches you must complete is to:
Create a BKTree from a collection of words .
As until your template has a fixed length (1 in your case), you can simply request your BKTree for nodes with an exact edit distance n , with n being the number of wildcards. Traditional BKTree queries have an upper rejection limit . In your case, you want to introduce an additional lower bound , narrowing the range of the accepted variance to an accuracy of [n,1] (compared to traditionally [0,n] ). You will get an array of words different from your query word using extra n characters.
For **id request some possible matches are possible:
void (2x add-ons)laid (2x add-ons)bad (1x replacement, 1x addition)to (2x replacements)
While these are not yet correct matches for your query , imagine a very small subset of your general collection of words.
So, but not least, you re-run Regex against these results and return any remaining matches .
BKTrees enter levenshtein distance as some spatial heuristic to drastically ( depending on the entropy in your collection of words) reduce the number of comparisons / comparisons required .
To get additional optimization , you can use several BKTrees :
Divide your collection into subsets. One set of words, length 1 , one for length 2 , one for 3 , etc. From each subset, you create a BKTree. For **id request, you then request BKTree for length 4 (wildcards are considered characters).
This applies to wildcards that are interpreted as m/./ . If your template will be interpreted as m/.?/ , you will request BKTrees for lengths 3 and 4.
Alternatively, before BKTrees, you can also use GADDAG , which is a data structure (Trie specialization) specifically designed for Scrabble-style searches.
If I'm not mistaken, your wildcards will have to be interpreted strictly as m/./ .
Dynamic wildcards
I canβt think of any much better solution now than using a regular expression against your collection of words.