Recursively parse custom markup

I have to process an already existing custom markup language (which is ugly, but unfortunately cannot be changed because I am processing outdated data and it must remain compatible with the outdated application).

I need to parse the “ranges” commands, and depending on the action taken by the user, either replace these “ranges” in the data with something else (HTML or LaTeX code), or completely remove these “ranges” from the input.

My current solution is using preg_replace_callback() in a loop until there are matches, but for huge documents it is extremely slow. (i.e. ~ 7 seconds for 394 replacements in a 57K document)

Recursive regular expressions do not seem flexible enough for this task, since I need to access all matches, even in recursion.

Question: How can I improve the performance of my parsing?

Regular expressions can be completely removed - they are not a requirement, but the only thing I can think of.

Note. The code sample below is greatly reduced. (SSCCE) In fact, there are many different “types” of ranges, and the close function does different things depending on the mode of operation. (insert values ​​from the database, delete entire ranges, convert to another format, etc.). Keep this in mind!

An example of what I'm doing now:

 <?php $data = <<<EOF some text 1 begin-command some text 2 begin-command some text 3 command-end some text 4 begin-command-if "%VAR%" == "value" some text 5 begin-command some text 6 command-end command-end command-end EOF; $regex = '~ # opening tag begin-(?P<type>command(?:-if)?) # must not contain a nested "command" or "command-if" command! (?!.*begin-command(?:-if)?.*command(?:-if)?-end) # the parameters for "command-if" are optional (?: [\s\n]*? (?:")[\s\n]*(?P<leftvalue>[^\\\\]*?)[\s\n]*(?:") [\s\n]* # the operator is optional (?P<operator>[=<>!]*) [\s\n]* (?:")[\s\n]*(?P<rightvalue>[^\\\\]*?)[\s\n]*(?:") [\s\n]*? )? # the real content (?P<content>.*?) # closing tag command(?:-if)?-end ~smx'; $counter = 0; $loop_replace = true; while ($loop_replace) { $data = preg_replace_callback($regex, function ($matches) use ($counter) { global $counter; $counter++; return "<command id='{$counter}'>{$matches['content']}</command>"; }, $data, -1, $loop_replace); } echo $data; 
0
source share
2 answers

I completely removed regular expressions for parsing. I realized that in fact, the original input can be seen as an XML markup tree in some strange representation.

Instead of using regular expressions, I now do the following:

  • Replace anything that can be interpreted as XML with textual representation (using XML objects)
  • Replace all begin-command ... command-end blocks with appropriate XML tags
    (Note that there are actually several different commands)
  • Let the real parser (XML DOM) process the markup tree
  • Iterate over recursively DOM
  • For each Node, take the appropriate action, depending on the operating mode.

It seems ugly, but I really did not want to write my own parser, which seemed a bit "redundant" in the limited time that I have to improve speed. And oh, the boy who is still flashing fast is much faster than the RegExp solution. Impressive when you consider the overhead of converting the original input into valid XML and vice versa.

With "blazing fast," I mean that now it only takes ~ 200 ms for a document that previously took 5-7 seconds to parse multiple regular expressions.

Here is the code I'm using now:

 // convert raw input to valid XML representation $data = str_replace( array('<', '>', '&'), array('&lt;', '&gt;', '&amp;'), $data ); $data = preg_replace( '!begin-(command|othercommand|morecommand)(?:-(?P<options>\S+))?!', '<\1 options="\2">', $data ); $data = preg_replace( '!(command|othercommand|morecommand)-end!', '</\1>', $data ); // use DOM to parse XML representation $dom = new \DOMDocument(); $dom->loadXML("<?xml version='1.0' ?>\n<document>".$data.'</document>'); $xpath = new \DOMXPath($dom); // iterate over DOM, recursively replace commands with conversion results foreach($xpath->query('./*') as $node) { if ($node->nodeType == XML_ELEMENT_NODE) convertNode($node, 'form', $dom, $xpath); } // convert XML DOM back to raw format $data = $dom->saveXML(); $data = substr($data, strpos($data, "<document>")+10, -12); $data = str_replace( array('&amp;', '&lt;', '&gt;'), array('&', '<', '>'), $data ); // output the stuff echo $data; function convertNode (\DomNode $node, $output_mode, $dom, $xpath) { $type = $node->tagName; $children = $xpath->query('./*', $node); // recurse over child nodes foreach ($children as $childNode) { if ($childNode->nodeType == XML_ELEMENT_NODE) { convertNode($childNode, $output_mode, $dom, $xpath); } } // in production code, here is actual logic // to process the several command types $newNode = $dom->createTextNode( "<$type>" . $node->textContent . "</$type>" ); // replace node with command result if ($node->parentNode) { $node->parentNode->replaceChild($newNode, $node); // just to be sure - normalize parent node $newNode->parentNode->normalize(); } } 
0
source

Your look at the fourth line of your regular expression:

 (?!.*begin-command(?:-if)?.*command(?:-if)?-end) 

this will have to be read to the end of your file every time it is encountered (with the modifiers used)

does yours. * lazy can slightly improve the performance of these large files:

 (?!.*?begin-command(?:-if)?.*?command(?:-if)?-end) 

also if (?: - if)? will always come after begin-command, you can just get rid of it there, do something like:

 (?!.*?begin-command.*?command(?:-if)?-end) 
0
source

All Articles