Efficient algorithm for matching strings with a very large set of patterns

I am looking for an efficient algorithm capable of finding all patterns matching a specific string. The set of templates can be very large (over 100,000) and dynamic (templates added or removed at any time). Patterns are not necessarily standard regular expressions; they can be a subset of a regular expression or something similar to a shell pattern (i.e. file-*.txt ). A preferred solution is for a subset of the regular expression (as described below).

FYI: I'm not interested in brute force approaches based on the RegExp list.

With simple regex, do I mean a regex that supports ? , * , + , character classes [az] and possibly the logical operator | .

To clarify my need: I want to find all the patterns matching the url:

 http://site1.com/12345/topic/news/index.html 

The answer should be these patterns, based on the set of patterns below.

 http://*.site1.com/*/topic/* http://*.site1.com/* http://* 

Pattern Set:

 http://*.site1.com/*/topic/* http://*.site1.com/*/article/* http://*.site1.com/* http://*.site2.com/topic/* http://*.site2.com/article/* http://*.site2.com/* http://* 
+5
source share
3 answers

One approach that comes to mind is to create tree-like structure patterns.

Example: http://* will contain all the templates (listed above). http://*.site1.com/* will contain all site1.com . This can significantly reduce the number of patterns that need to be verified.

In addition, you can determine which patterns are mutually exclusive to further crop the list you are looking for.

So, first take all the templates and create trees from them. Find all the roots to determine which branches and nodes should be analyzed.

Improve the algorithm by determining which branches are mutually exclusive, therefore, as soon as you find a hit on this branch, you should know which branches / nodes do not need to be visited.

To get started, you can be lazy, and your first pass may be to sort the templates and make them simple. The following template contains this template type logic to determine if "this" is contained in the following. EX: if( "http://*.site1.com/*".startsWith("http://*") == true )

You can complicate your ability to determine if one template really contains another, but this will get you started.

To better understand the question:

"Does this template contain this template?"

I believe that you will need to parse the regex ... This article looks like a good place to start to figure out how to do this: Parsing regular expressions with recursive descent

+2
source

Here is an approach that I have used quite successfully:

Adding templates:

For any pattern, there is a set of substrings that the string must contain in order to be able to match it. What are these meta words? For instance:

 dog*fish -> [dog, fish] [lfd]og -> [og] dog? -> [dog] 

When you add a template to the data structure, break it into meta words and save them in the Aho-Corasick data line corresponding to the data structure. Maintain an internal data structure for matching meta words back to template words.

Running queries:

For the given input string, use the Aho-Corasick data structure that you created to retrieve all the meta words contained in this string. Then, using the map you created, test the patterns matching these meta words.

This works well, because although string matching is rather slow, you can very quickly narrow down the number of patterns that you really need to match. I wrote an implementation of this here that can execute about 200 thousand queries per second on my laptop against sets of 150, 000+ patterns. Check the labeling mode in the program to check it.

+1
source

If the set of URLs does not change very quickly, you really should use a regular expression engine that compiles its patterns. Java provides one of these, but it may be unsatisfactory if you want to know which template matches.

The widely used mechanism for this and the determination of compliance are various lexer generators such as FLEX and similar tools. They take the number of regular expressions for each "token" and build an integrated FSA to recognize any of them that is extremely efficient to perform.

You can call Flex when your set changes. If it's too slow, get the open source Flex version and integrate into your engine; it internally builds the FSA so you can use it directly. (You may need some kind of technique). But if you really have a problem with high performance, some work to do it well does not bother you.

If the set of URLs changes faster than FLEX can generate FSA (odd), you have a real problem. In this case, you can create a discrimination tree on the Internet by scanning the “regular expression” from left to right and integrating the symbols / predicates that you see into the existing discrimination tree. The fit then is to go down the discrimination tree by performing various tests; if you come to the list, you will have a match, otherwise not. It can be as fast as automatic generation with FLEX, if done correctly, but probably a lot, a lot more.

0
source

All Articles