{someTe...">

PHP: parsing with xml names only

I am trying to parse such data:

<vin:layout name="Page" xmlns:vin="http://www.example.com/vin"> <header> {someText} <div> <!-- some invalid xml code --> <aas> <nav class="main"> <vin:show section="Menu" /> </nav> </div> </header> </vin:layout> 

How can I parse data like this in PHP?

I tried the DOM, but it does not work, due to incorrect xml inside the root element. Can I tell the parser that the infinite without vin namespace is text?

+6
source share
1 answer

I would probably throw a kind of tag analyzer on it. Something that your format can read, which apart from these flaws, looks pretty well written. Nothing textually stays on the way to a simple scanner based on a common expression. I called my Tagsoup only four node types: Starttag, Endtag, Text and Comment. For tags, you need to know about their Tag and NamespacePrefix. It is simply called similar to XML / HTML for certainty, but in reality it all "steers your own," so don't stretch these terms by any standards.

Using to modify each tag (beginning or end) that does not have a namespace prefix might look like this ( $string contains the data that you have in your question):

 $scanner = new TagsoupIterator($string); $nsPrefix = 'vin'; foreach ($scanner as $node) { $isTag = $node instanceof TagsoupTag; $isOfNs = $isTag && $node->getTagNsPrefix() === $nsPrefix; if ($isTag && !$isOfNs) { $node = strtr($node, ['&' => '&amp;', '<' => '&lt;']); } echo $node; } 

Conclusion:

 <vin:layout name="Page" xmlns:vin="http://www.example.com/vin"> &lt;header> {someText} &lt;div> <!-- some invalid xml code --> &lt;aas> &lt;nav class="main"> <vin:show section="Menu" /> &lt;/nav> &lt;/div> &lt;/header> </vin:layout> 

Using to extract everything inside a specific namespace tag might look like this:

 $scanner = new TagsoupIterator($string); $parser = new TagsoupForwardNavigator($scanner); $startTagWithNsPrefix = function ($namespace) { return function (TagsoupNode $node) use ($namespace) { /* @var $node TagsoupTag */ return $node->getType() === Tagsoup::NODETYPE_STARTTAG && $node->getTagNsPrefix() === $namespace; }; }; $start = $parser->nextCondition($startTagWithNsPrefix('vin')); $tag = $start->getTagName(); $parser->next(); echo $html = implode($parser->getUntilEndTag($tag)); 

Conclusion:

 <header> {someText} <div> <!-- some invalid xml code --> <aas> <nav class="main"> <vin:show section="Menu" /> </nav> </div> </header> 

The next part should replace this part of $string . Since the TagsUp offers binary offsets and lengths, this is easy (and I quickly shorten through SimpleXML):

 $xml = substr($string, 0, $start->getEnd()) . substr($string, $parser->getOffset()); $doc = new SimpleXMLElement($xml); $doc[0] = $html; echo $doc->asXML(); 

Conclusion:

 <vin:layout xmlns:vin="http://www.example.com/vin" name="Page"> &lt;header&gt; {someText} &lt;div&gt; &lt;!-- some invalid xml code --&gt; &lt;aas&gt; &lt;nav class="main"&gt; &lt;vin:show section="Menu" /&gt; &lt;/nav&gt; &lt;/div&gt; &lt;/header&gt; </vin:layout> 

Depending on the specific needs, this will require a change in implementation. For example, this will not allow you to put the same tags into each other. It does not throw you away, however it does not cope with it. I have no idea if you have this case, if you need to add some kind of open / close counter, the navigator class can be easily extended for this, even if you offer two methods for finding end tags.

The examples here use tags that you can see in this context: https://gist.github.com/4415105

+1
source

All Articles