I created a router in PHP that accepts DSL (based on the Rails 3 route) and converts it to Regex. It has optional segments (indicated by (nested) brackets). The following is the current lexing algorithm:
private function tokenize($pattern) { $rules = array( self::OPEN_PAREN_TYPE => '/^(\()/', self::CLOSE_PAREN_TYPE => '/^(\))/', self::VARIABLE_TYPE => '/^:([a-z0-9_]+)/', self::TEXT_TYPE => '/^([^:()]+)/', ); $cursor = 0; $tokens = array(); $buffer = $pattern; $buflen = strlen($buffer); while ($cursor < $buflen) { $chunk = substr($buffer, $cursor); $matched = false; foreach ($rules as $type => $rule) { if (preg_match($rule, $chunk, $matches)) { $tokens[] = array( 'type' => $type, 'value' => $matches[1], ); $matched = true; $cursor += strlen($matches[0]); } } if (!$matched) { throw new \Exception(sprintf('Problem parsing route "%s" at char "%d".', $pattern, $cursor)); } } return $tokens; }
Are there any obvious accelerations I'm missing? Is there any way to discard preg_ * altogether or combine regular expressions into one pattern, etc.?
Here is the xhprof frame (~ 2500 unique test routes are used for testing):

I know that the best solution would not be to name it for every request (I plan on caching using APC, etc.), but I would like to make it as efficient as possible for people who use this library without using APC.
EDIT:
I also wrote a version of the fast state apparatus that seems to work better. I am still open to suggestions on both fronts, as I believe the first code was more elegant.
private function tokenize2($pattern) { $buffer = ''; $invariable = false; $tokens = array(); foreach (str_split($pattern) as $char) { switch ($char) { case '(': if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $tokens[] = array( 'type' => self::OPEN_PAREN_TYPE, ); break; case ')': if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $tokens[] = array( 'type' => self::CLOSE_PAREN_TYPE, ); break; case ':': if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $invariable = true; break; default: if ($invariable && !(ctype_alnum($char) || '_' == $char )) { $invariable = false; $tokens[] = array( 'type' => self::VARIABLE_TYPE, 'value' => $buffer, ); $buffer = ''; $invariable = false; } $buffer .= $char; break; } } if ($buffer) { $tokens[] = array( 'type' => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE, 'value' => $buffer, ); $buffer = ''; } return $tokens;

In the end, I just used the state machine for performance reasons and cached the entire lexing process using APC (because ... why not).
If someone can contribute, I will be happy to answer.