How to get Matlab to read the correct number of xml nodes

Question

How to get Matlab to read the correct number of xml nodes

I am reading a simple XML file using the matlab xmlread internal function.

<root> <ref> <requestor>John Doe</requestor> <project>X</project> </ref> </root>

But when I call getChildren () on the ref element, it tells me that it has 5 .

It works great IF . I put all the XML in an ONE string. Matlab tells me that the ref element has 2 children.

It doesn't seem like spaces between elements.

Even if I run Canonicalize in the oXygen XML editor, I still get the same results. Because Canonicalize still leaves blanks.

Matlab uses java and xerces for xml material.

Question:

What can I do to save the xml file in a readable format (not all on one line), but still Matlab parsed it correctly?

Code Update:

 filename='example01.xml'; docNode = xmlread(filename); rootNode = docNode.getDocumentElement; entries = rootNode.getChildNodes; nEnt = entries.getLength

+8

xml-parsing matlab

capdragon Jul 18 '12 at 19:09

source share

2 answers

I felt @cholland's answer was good, but I didn't like the extra xml work. So, here is a solution to remove spaces from a copy of an xml file, which is the main cause of unwanted elements.

 fid = fopen('tmpCopy.xml','wt'); str = regexprep(fileread(filename),'[\n\r]+',' '); str = regexprep(str,'>[\s]*<','><'); fprintf(fid,'%s', str); fclose(fid);

+1

ldgorman Jun 19 '17 at 15:39

source share

cholland · Accepted Answer · 2012-07-19T01:44:55+0000

An XML parser behind the scenes creates #text nodes for all spaces between node elements. Whereever has a new line or indent, this will create a #text node with a new line and the following indent spaces in the node data part. So, in the xml example that you specified when parsing the child nodes of the ref element, it returns 5 nodes

Node 1: # text with newlines and indents
Node 2: a "requestor" node, which in turn has a #text password with "John Doe" in the data part
Node 3: # text with new line and indent fields
Node 4: a “project” node, which in turn has a #text child with an “X” in the data part
Node 5: # text with new line and indent fields

This function removes all these useless #text nodes for you. Please note: if you intentionally have an xml element consisting of nothing but a space, this function will delete it, but for 99.99% of cases xml this should work fine.

 function removeIndentNodes( childNodes ) numNodes = childNodes.getLength; remList = []; for i = numNodes:-1:1 theChild = childNodes.item(i-1); if (theChild.hasChildNodes) removeIndentNodes(theChild.getChildNodes); else if ( theChild.getNodeType == theChild.TEXT_NODE && ... ~isempty(char(theChild.getData())) && ... all(isspace(char(theChild.getData())))) remList(end+1) = i-1; % java indexing end end end for i = 1:length(remList) childNodes.removeChild(childNodes.item(remList(i))); end end

Call it like this:

 tree = xmlread( xmlfile ); removeIndentNodes( tree.getChildNodes );

How to get Matlab to read the correct number of xml nodes

Question:

Code Update:

More articles: