Profiling Regex Lexer

Question

Profiling Regex Lexer

I created a router in PHP that accepts DSL (based on the Rails 3 route) and converts it to Regex. It has optional segments (indicated by (nested) brackets). The following is the current lexing algorithm:

private function tokenize($pattern) { $rules = array( self::OPEN_PAREN_TYPE => '/^(\()/', self::CLOSE_PAREN_TYPE => '/^(\))/', self::VARIABLE_TYPE => '/^:([a-z0-9_]+)/', self::TEXT_TYPE => '/^([^:()]+)/', ); $cursor = 0; $tokens = array(); $buffer = $pattern; $buflen = strlen($buffer); while ($cursor < $buflen) { $chunk = substr($buffer, $cursor); $matched = false; foreach ($rules as $type => $rule) { if (preg_match($rule, $chunk, $matches)) { $tokens[] = array( 'type' => $type, 'value' => $matches[1], ); $matched = true; $cursor += strlen($matches[0]); } } if (!$matched) { throw new \Exception(sprintf('Problem parsing route "%s" at char "%d".', $pattern, $cursor)); } } return $tokens; }

Are there any obvious accelerations I'm missing? Is there any way to discard preg_ * altogether or combine regular expressions into one pattern, etc.?

Here is the xhprof frame (~ 2500 unique test routes are used for testing):

callgraph

I know that the best solution would not be to name it for every request (I plan on caching using APC, etc.), but I would like to make it as efficient as possible for people who use this library without using APC.

EDIT:

I also wrote a version of the fast state apparatus that seems to work better. I am still open to suggestions on both fronts, as I believe the first code was more elegant.

 private function tokenize2($pattern) { $buffer = ''; $invariable = false; $tokens = array(); foreach (str_split($pattern) as $char) { switch ($char) { case '(': if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $tokens[] = array( 'type' => self::OPEN_PAREN_TYPE, ); break; case ')': if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $tokens[] = array( 'type' => self::CLOSE_PAREN_TYPE, ); break; case ':': if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $invariable = true; break; default: if ($invariable && !(ctype_alnum($char) || '_' == $char )) { $invariable = false; $tokens[] = array( 'type' => self::VARIABLE_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $buffer .= $char; break; } } if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; } return $tokens;

callgraph

In the end, I just used the state machine for performance reasons and cached the entire lexing process using APC (because ... why not).

If someone can contribute, I will be happy to answer.

+4

php lexical-analysis xhprof

efritz Aug 18 '11 at 23:15

source share

1 answer

preinheimer · Accepted Answer · 2011-10-25T18:53:38+0000

Interesting code :).

I'm not quite sure what you are saying is "caching the entire lexing process using APC," so I can assume that you are already doing it.

Is it possible to simply cache the input url and result above the actual lexing process? You are not looking at applying permission-based restrictions here, so the cache is global. Routes are usually limited in number, even over a large stretch, with a few very hot spots. Bypass lexing completely and hit the cache on any previously used route.

Profiling Regex Lexer

More articles: