Determine the specificity of a regular expression

Given the following regular expressions:

- alice@ [az]+\.[az]+ - [az] +@ [az]+\.[az]+ - .* 

The string alice@myprovider.com will obviously match all three regular expressions. In the application that I am developing, we are only interested in the “most specific” match. In this case, this is obviously the first. Unfortunately, there is no way to do this. We use PCRE, and I did not find a way to do this, and an Internet search was not fruitful either.
A possible way would be to save regular expressions sorted by descending specificity, and then just take the first match. Of course, the next question: how to sort an array of regular expressions. It is not an option to give the responder an end user to make sure the array is sorted. So hopefully you guys could help me here ...

Thanks!

Floor

+4
source share
3 answers

I am working on an answer to the same problem that I have found so far: http://maple.cs.umbc.edu/~don/projects/ugrad-ht/dminer-ugradthesis.pdf

This is a graduate level research paper using perl regex, it has a workable definition for "the most specific regular expression", and triggers a warning if there are two regular expression expressions with the same specificity. It is partly based on the SELinux installation file, but aims to be faster and more accurate. Setfile leaves it to the user so that matches go from the most specific to the least specific and take the first match. This may cause some problems that the research paper must solve.

In principle, the most specific match is one that is not a superset of any other match. The difficulty in solving is to determine which sets are supersets of other sets; Of course, the answer to this depends on the circumstances for which regular expression is required. When you have a list of supersets, then it becomes a matter of eliminating matches. So with regex expressions' ^ /. * ',' ^ / Usr /.* 'and' ^ / home /.* ',' ^ /. * 'Is a superset of the other two, and the other two are mutually exclusive. In the correct implementation, if the two second ones were not mutually exclusive ("^" is missing), and none of them is a superset of the other, a warning or error should be issued to the user or user. For a given string, in order to check the correspondence, it must first be checked against any supersets (in this case "^ /. *"). If it does not match the superset, it cannot match any particular pattern. If it matches, then a test should be performed against each of the children of the superset (these sets can also be supersets of additional sets). If it does not match any of the children, then the most specific regular expression is a superset ('^ /. *'). If it corresponds to one of the children, then the process should be repeated with the associated grandchildren until there are no specific sets or none of the specific sets matches.

It is enough not to issue warnings about non-mutually exclusive non-super-sets unless an attempt is made to match strings that cannot be resolved. Consider the many regular expression expressions: '^ /. * ',' /usr.* 'and' /home.* '. The string '/ home / usr' will match all three, and an attempt to match should throw an error, since it is unclear if '/usr.*' or '/home.*' is intended as the most specific regular expression.

Depending on the reasons that need to be addressed, a valid list of regular expressions that are not supersets of any other suitable regular expressions may be the ideal solution. In this case, '/ home / usr' should return '/home.*' and '/usr.*', but not '^ /. * '.

The document provides code examples, but is described only in abstract terms. I will try to write some actual code to implement it, or maybe write to the author and see if I can get the code, if I get something that really works, I will post it here.

+5
source

My gut instinct says that this is not only a complex problem, both in terms of computational cost and implementation complexity, but it can be unsolvable in any realistic way. Consider the following two regular expressions to accept the string alice@myprovider.com

  alice@ [az] + \. [az] + 
     [az] +@myprovider.com 

Which one is more specific?

+4
source

I am thinking of a similar problem for the PHP project route parser. After reading other answers and comments here, and also thinking about the cost, I could go in a completely different direction.

However, the solution is to simply sort the list of regular expressions in order of line length.

This is not ideal, but simply by deleting the [] groups, it will be much closer. In the first example, the question will list this:

 - alice@ [az]+\.[az]+ - [az] +@ [az]+\.[az]+ - .* 

To do this, after deleting the contents of any [] group:

 - alice@ +\.+ - +@ +\.+ - .* 

The same applies to the second example in another answer, when [] groups are completely deleted and sorted by length, this is:

 alice@ [az]+\.[az]+ [az] +@myprovider.com 

Will be sorted as:

 +@myprovider.com alice@ +\.+ 

This is a good enough solution, at least for me, if I decide to use it. The downside would be the overhead of deleting all the groups [] before sorting and applying sorting in the unchanged list of regular expressions, but hey - you can't get everything.

0
source

All Articles