What is the best way to analyze the vocabulary and show a large number of matches using wild cards

My problem is that I have a vocabulary of about 200,000 words. File size - 1.8 MB. I want to enter data from the user, say ** id, and I want to show all possible matches, where * can be any letter AZ. (said maid, etc.)

I am looking for some suggestions on the most efficient way to do this, because I want the user to be able to add more specific letters and give a live update of word matches.

My idea was to try using RegexKitLite, but I have a feeling that it will be incredibly slow.

Thanks for any input!

Edit: Do you think NSPredicates can be used to achieve this?

+4
source share
1 answer

What you can do to optimize your search depends heavily on how you want to limit the use of these wildcards.

Exactly: what are the characteristics of your wildcards?

  • prefix only wildcards (m /.+ foobar /)
  • suffix-only wildcards (m / foobar. + /)
  • atomic symbols ( m/./ )
  • dynamic wildcards ( m/.+/ )

Wildcards

Use a prefix tree or DAWG


Wildcard suffix only

Use the suffix tree or DAWG


Atomic wildcards

One way to drastically reduce the number of matches you must complete is to:

Create a BKTree from a collection of words .

As until your template has a fixed length (1 in your case), you can simply request your BKTree for nodes with an exact edit distance n , with n being the number of wildcards. Traditional BKTree queries have an upper rejection limit . In your case, you want to introduce an additional lower bound , narrowing the range of the accepted variance to an accuracy of [n,1] (compared to traditionally [0,n] ). You will get an array of words different from your query word using extra n characters.

For **id request some possible matches are possible:

  • void (2x add-ons)
  • laid (2x add-ons)
  • bad (1x replacement, 1x addition)
  • to (2x replacements)

While these are not yet correct matches for your query , imagine a very small subset of your general collection of words.

So, but not least, you re-run Regex against these results and return any remaining matches .

BKTrees enter levenshtein distance as some spatial heuristic to drastically ( depending on the entropy in your collection of words) reduce the number of comparisons / comparisons required .

To get additional optimization , you can use several BKTrees :

Divide your collection into subsets. One set of words, length 1 , one for length 2 , one for 3 , etc. From each subset, you create a BKTree. For **id request, you then request BKTree for length 4 (wildcards are considered characters).
This applies to wildcards that are interpreted as m/./ . If your template will be interpreted as m/.?/ , you will request BKTrees for lengths 3 and 4.


Alternatively, before BKTrees, you can also use GADDAG , which is a data structure (Trie specialization) specifically designed for Scrabble-style searches.

If I'm not mistaken, your wildcards will have to be interpreted strictly as m/./ .


Dynamic wildcards

I can’t think of any much better solution now than using a regular expression against your collection of words.

+4
source

All Articles