I completely removed regular expressions for parsing. I realized that in fact, the original input can be seen as an XML markup tree in some strange representation.
Instead of using regular expressions, I now do the following:
- Replace anything that can be interpreted as XML with textual representation (using XML objects)
- Replace all
begin-command ... command-end blocks with appropriate XML tags
(Note that there are actually several different commands) - Let the real parser (XML DOM) process the markup tree
- Iterate over recursively DOM
- For each Node, take the appropriate action, depending on the operating mode.
It seems ugly, but I really did not want to write my own parser, which seemed a bit "redundant" in the limited time that I have to improve speed. And oh, the boy who is still flashing fast is much faster than the RegExp solution. Impressive when you consider the overhead of converting the original input into valid XML and vice versa.
With "blazing fast," I mean that now it only takes ~ 200 ms for a document that previously took 5-7 seconds to parse multiple regular expressions.
Here is the code I'm using now:
// convert raw input to valid XML representation $data = str_replace( array('<', '>', '&'), array('<', '>', '&'), $data ); $data = preg_replace( '!begin-(command|othercommand|morecommand)(?:-(?P<options>\S+))?!', '<\1 options="\2">', $data ); $data = preg_replace( '!(command|othercommand|morecommand)-end!', '</\1>', $data ); // use DOM to parse XML representation $dom = new \DOMDocument(); $dom->loadXML("<?xml version='1.0' ?>\n<document>".$data.'</document>'); $xpath = new \DOMXPath($dom); // iterate over DOM, recursively replace commands with conversion results foreach($xpath->query('./*') as $node) { if ($node->nodeType == XML_ELEMENT_NODE) convertNode($node, 'form', $dom, $xpath); } // convert XML DOM back to raw format $data = $dom->saveXML(); $data = substr($data, strpos($data, "<document>")+10, -12); $data = str_replace( array('&', '<', '>'), array('&', '<', '>'), $data ); // output the stuff echo $data; function convertNode (\DomNode $node, $output_mode, $dom, $xpath) { $type = $node->tagName; $children = $xpath->query('./*', $node); // recurse over child nodes foreach ($children as $childNode) { if ($childNode->nodeType == XML_ELEMENT_NODE) { convertNode($childNode, $output_mode, $dom, $xpath); } } // in production code, here is actual logic // to process the several command types $newNode = $dom->createTextNode( "<$type>" . $node->textContent . "</$type>" ); // replace node with command result if ($node->parentNode) { $node->parentNode->replaceChild($newNode, $node); // just to be sure - normalize parent node $newNode->parentNode->normalize(); } }