Efficient string matching algorithm

I am trying to build an efficient string matching algorithm. This will be done in a high volume environment, so performance is critical.

Here are my requirements:

  • Given the domain name, that is, www.example.com, determine if it matches “one” in the list of entries.
  • Records can be absolute, i.e. www.example.com.
  • Entries may include wildcards, i.e. * .example.com.
  • Wildcard entries correspond to the most specific level and higher. For example, * .example.com will match www.example.com, example.com and sub.www.example.com.
  • Wildcard entries are not embedded, i.e. under. *. example.com will not be included.

Language / Environment: C # (.Net Framework 3.5)

I looked at splitting records (and domain search) into arrays, reordering, and then iterating through arrays. Although it is accurate, it feels slow.

I reviewed Regex, but I'm worried about accurately representing the list of entries as regular expressions.

My question is: what is the efficient way to determine if a string in the form of a domain name matches any of the string list, given the above description?

+4
c # algorithm
source share
14 answers

If you want to collapse your own, I would save the entries in a tree structure. See my answer to another SO spellchecker question to see what I mean.

Instead of designating a structure with "." characters, I would just treat each entry as a complete line. Any equivalent implementation will still need to match strings across the entire character set, so you can do it all in one shot.

The only differences between this and the regular spellcheck tree are:

  • Compliance must be performed in reverse order.
  • You must consider wildcards

To indicate point # 2, you just have to check the “*” at the end of the test.

Quick example:

Entries:

*.fark.com www.cnn.com 

Wood:

 m -> o -> c -> . -> k -> r -> a -> f -> . -> * \ -> n -> n -> c -> . -> w -> w -> w 

Checking www.blog.fark.com will include tracing through the tree to the first "*" . Since the crawl ended in "*" , there is a match.

Checking www.cern.com will fail in the second "n" of n, n, c, ...

Checking dev.www.cnn.com will also fail, because the crawl ends with a character other than "*" .

+12
source share

I would use Regex, just make sure this expression is compiled once (instead of being evaluated over and over).

+5
source share

you don't need regexp .. just cancel all the lines, get rid of the '*' and put a flag to indicate partial matching up to this point.

Anyway, trie or suffix trie looks the most suitable.

If the list of domains is known at compile time, you can look at the tokenization in '.' and using several machines created by gperf.

Links: google for trie http://marknelson.us/1996/08/01/suffix-trees/

+5
source share

I would use a tree structure to store rules, where each node tree contains / contains a dictionary.

Build a tree so that "com", "net", etc. are top-level elements, an “example” is at the next level, etc. You will need a special flag to indicate that node is a wildcard.

To perform a search, split the string by period and iterate backward by moving the tree based on the input.

This is similar to what you say, you counted, but provided that the rules do not change every run, using a cached tree based on dictionaries will be faster than a list of arrays.

In addition, I would argue that this approach will be faster than RegEx.

+3
source share

You seem to have a well-defined set of rules regarding what you consider permissible - you can use the handwritten LL parser to do this. Such parsers are relatively easy to write and optimize. Usually you should have a parser representing a tree structure that describes the input - I would use this tree as input for a matching routine that does the job of matching the tree with a list of records using the rules described above.

Here's an article about recursive descent parsers .

+2
source share

Assuming the rules are the same as you said: literal or begin with *.

Java:

 public static boolean matches(String candidate, List<String> rules) { for(String rule : rules) { if (rule.startsWith("*")) { rule = rule.substring(2); } if (candidate.endsWith(rule)) { return true; } } return false; } 

It scales to the number of rules you have.

EDIT:

Just to be understood here.

When I say "sort the rules", I really want to create a tree from the characters of the rule.

Then you use the match string to try and traverse the tree (i.e. if I have the string xyz, I start with the x character and see if it has a y branch, and then a z-element).

For “wildcards,” I use the same concept, but fill it in “back” and go through the back of the candidate.

If you have LOT (LOT LOT) rules, I would sort the rules.

For asymmetric matches, you repeat for each character the narrowing of the possible rules (i.e. if it starts with "w", then you work with the rules of "w", etc.)

If it is a wildcard, you do the same, but you work against the list of "reverse rules" and simply match the end of the line to the end of the rule.

+1
source share

I would try a combination of trying with longest-prefix matching (which is used in routing for IP networks). Word directed acyclic graphs may be more appropriate than trying if space is a problem.

+1
source share

I am going to offer an alternative to the tree structure. Create a compressed index of your domain list using the Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

+1
source share

Take a look at RegExLib

0
source share

Not sure if your ideas were for splitting and iterating, but it doesn't seem like it will be slow:

Divide the domains up and back, as you said. Storage can essentially be a tree. Use a hash table to store TLDs. The key will be, for example, "com", and the values ​​will be a hash table of subdomains under this TLD, ad nauseum iteration.

0
source share

Given your requirements, I think you're on the go, thinking of working from the end of the line (TLD) in the direction of the host name. You can use regular expressions, but since you are not really using any of the features of regular expressions, I don’t understand why you want to bear their cost. If you cancel the lines, it becomes more obvious that you are really just looking for a prefix mapping ('* .example.com' becomes: "is moc.elpmaxe" the beginning of my input line?), Which certainly doesn't require anything as heavy as regular expressions.

What structure you use to store the list of records depends on how big the list is and how often it changes ... for a huge stable list, the / trie tree may be the most efficient; a frequently changing list needs a structure that is easy to initialize / update, etc. Without additional information, I would not want to offer any one structure.

0
source share

I assume that I am tempted to answer your question to others: what are you doing, what, in your opinion, is your bottleneck some string corresponding above and above simmple string-compare? Surely something else is mentioned above in your performance profiling?

First, I would use the obvious string comparison tests, which will be right in 90% of cases, and if they don't work, then back up to regex

0
source share

If it were only the corresponding rows, then you should look at datastructures and algorithms. The earlier answer suggests that if all of your wildcards are one wildcard at the beginning, there are some specific algorithms you can use. However, the requirement to handle common wildcards means that you need to create a state machine for fast execution.

What the regex library does for you: "precompiling" regex == state machine generation; this allows you to quickly match the actual match at runtime. You are unlikely to get significantly better performance than without much optimization effort.

If you want to collapse your own, I can say that writing your own generator of state machines specifically for several wildcards should be educational. In this case, you will need to read about what algorithms they use in regular expression libraries ...

0
source share

Learn the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. This allows you to search for a string faster than linear time, due to a little preprocessing. Dropping the drive sprocket, of course, is crucial, as others have noted.

One source of information for them:

KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

0
source share

All Articles