What is the best way to analyze the vocabulary and show a large number of matches using wild cards

Question

What is the best way to analyze the vocabulary and show a large number of matches using wild cards

My problem is that I have a vocabulary of about 200,000 words. File size - 1.8 MB. I want to enter data from the user, say ** id, and I want to show all possible matches, where * can be any letter AZ. (said maid, etc.)

I am looking for some suggestions on the most efficient way to do this, because I want the user to be able to add more specific letters and give a live update of word matches.

My idea was to try using RegexKitLite, but I have a feeling that it will be incredibly slow.

Thanks for any input!

Edit: Do you think NSPredicates can be used to achieve this?

+4

text objective-c iphone text-parsing fuzzy-search

Trickerie Jun 12 '11 at 20:37

source share

1 answer

Regexident · Answer 1 · 2011-06-12T21:13:13+0000

What you can do to optimize your search depends heavily on how you want to limit the use of these wildcards.

Exactly: what are the characteristics of your wildcards?

prefix only wildcards (m /.+ foobar /)
suffix-only wildcards (m / foobar. + /)
atomic symbols ( m/./ )
dynamic wildcards ( m/.+/ )

Wildcards

Use a prefix tree or DAWG

Wildcard suffix only

Use the suffix tree or DAWG

Atomic wildcards

One way to drastically reduce the number of matches you must complete is to:

Create a BKTree from a collection of words .

As until your template has a fixed length (1 in your case), you can simply request your BKTree for nodes with an exact edit distance n , with n being the number of wildcards. Traditional BKTree queries have an upper rejection limit . In your case, you want to introduce an additional lower bound , narrowing the range of the accepted variance to an accuracy of [n,1] (compared to traditionally [0,n] ). You will get an array of words different from your query word using extra n characters.

For **id request some possible matches are possible:

void (2x add-ons)
laid (2x add-ons)
bad (1x replacement, 1x addition)
to (2x replacements)

While these are not yet correct matches for your query , imagine a very small subset of your general collection of words.

So, but not least, you re-run Regex against these results and return any remaining matches .

BKTrees enter levenshtein distance as some spatial heuristic to drastically ( depending on the entropy in your collection of words) reduce the number of comparisons / comparisons required .

To get additional optimization , you can use several BKTrees :

Divide your collection into subsets. One set of words, length 1 , one for length 2 , one for 3 , etc. From each subset, you create a BKTree. For **id request, you then request BKTree for length 4 (wildcards are considered characters).
This applies to wildcards that are interpreted as m/./ . If your template will be interpreted as m/.?/ , you will request BKTrees for lengths 3 and 4.

Alternatively, before BKTrees, you can also use GADDAG , which is a data structure (Trie specialization) specifically designed for Scrabble-style searches.

If I'm not mistaken, your wildcards will have to be interpreted strictly as m/./ .

Dynamic wildcards

I can’t think of any much better solution now than using a regular expression against your collection of words.

What is the best way to analyze the vocabulary and show a large number of matches using wild cards

Wildcards

Wildcard suffix only

Atomic wildcards

Dynamic wildcards

More articles: