Node navigation using xpath in a flat structure

I have an xml file in a flat structure. We do not control the format of this xml file, just need to deal with it. I renamed the fields because they are very specific to the domain and actually have nothing to do with this problem.

<attribute name="Title">Book A</attribute> <attribute name="Code">1</attribute> <attribute name="Author"> <value>James Berry</value> <value>John Smith</value> </attribute> <attribute name="Title">Book B</attribute> <attribute name="Code">2</attribute> <attribute name="Title">Book C</attribute> <attribute name="Code">3</attribute> <attribute name="Author"> <value>James Berry</value> </attribute> 

Key points: the file is not particularly hierarchical. Books are separated by a space for the attribute element with name = 'Title'. But the attribute name = 'Author' node is optional.

Is there a simple xpath operator that I can use to find the authors of book "n"? It is easy to identify the title of the book "n", but the meaning of the authors is optional. And you cannot just take the next author, because in the case of book 2, this will give the author book 3.

I wrote a state machine to parse this as a series of elements, but I can't help but think that there would be a way to get the results that I want.

+4
source share
5 answers

We want the attribute element from @name "Author" to follow the attribute element from @name "Title" with the value "Book n", without any other attribute element from @name 'Title' in between (because if there is, then the author has created some other book).

In other words, this means that we want the author who is preceded by the previous heading first (the one that he "belongs to"), the one we are looking for:

 //attribute[@name='Author'] [preceding-sibling::attribute[@name='Title'][1][contains(.,'Book N')]] 

N = C => finds <attribute name="Author"><value>James Berry</value></attribute>

N = B => does not find anything

Using the keys and / or groupings available in XSLT 2.0 will simplify (and much faster if the file is large).

(The SO parser seems to think that β€œ//” means β€œcomments,” but that is not the case with XPath. Sigh.)

+3
source

Well, I used Elementtree to extract data from the above XML. I saved this XML in a file called foo.xml

 from xml.etree.ElementTree import fromstring def extract_data(): """Returns list of dict of book and its authors.""" f = open('foo.xml', 'r+') xml = f.read() elem = fromstring(xml) attribute_list = elem.findall('attribute') dic = {} lst = [] for attribute in attribute_list: if attribute.attrib['name'] == 'Title': key = attribute.text if attribute.attrib['name'] == 'Author': for v in attribute.findall('value'): lst.append(v.text) value = lst lst = [] dic[key] = value return dic 

When you run this function, you will get the following:

 {'Book A': ['James Berry', 'John Smith'], 'Book C': ['James Berry']} 

Hope this is what you are looking for. If not, just specify a little more. :)

+2
source

As bambax noted in his answer, the XSLT key solution is more efficient , especially for large XML documents:

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes"/> <!-- --> <xsl:key name="kAuthByTitle" match="attribute[@name='Author']" use="preceding-sibling::attribute[@name='Title'][1]"/> <!-- --> <xsl:template match="/"> Book C Author: <xsl:copy-of select= "key('kAuthByTitle', 'Book C')"/> <!-- --> ==================== Book B Author: <xsl:copy-of select= "key('kAuthByTitle', 'Book B')"/> </xsl:template> </xsl:stylesheet> 

When the above conversion is applied to this XML document:

 <t> <attribute name="Title">Book A</attribute> <attribute name="Code">1</attribute> <attribute name="Author"> <value>James Berry</value> <value>John Smith</value> </attribute> <attribute name="Title">Book B</attribute> <attribute name="Code">2</attribute> <attribute name="Title">Book C</attribute> <attribute name="Code">3</attribute> <attribute name="Author"> <value>James Berry</value> </attribute> </t> 

the correct output is created:

  Book C Author: <attribute name="Author"> <value>James Berry</value> </attribute> ==================== Book B Author: 

Note that you should avoid abbreviation of the abbreviation "//" XPath as much as possible , as this usually causes the entire XML document to be scanned with every evaluation of the XPath expression.

+1
source

I’m not sure if you really want to get there: the simplest thing I found was to go from the author, get the previous title, and then check that the first author or title, which was really a title. Nasty!

 /books/attribute[@name="Author"] [preceding-sibling::attribute[@name="Title" and string()="Book B"] [following-sibling::attribute[ @name="Author" or @name="Title" ] [1] [@name="Author"] ] ][1] 

(I added the books tag to wrap the file).

I tested this with libxml2 BTW using xml_grep2 , but only on the sample data you gave, so more tests are welcome).

0
source

Select all headers and apply the template.

 <xsl:template match="/"> <xsl:apply-templates select="//attribute[@name='Title']"/> </xsl:template> 

In the template output header, check if the following header exists. If not, print the next author. If it exists, check if the next node author of the next book matches the same as the next node author of the current book. If so, it means that there is no author in the current book:

 <xsl:template match="*"> <book> <title><xsl:value-of select="."/></title> <author> <xsl:if test="not(following::attribute[@name='Title']) or following::attribute[@name='Author'] != following::attribute[@name='Title']/following::attribute[@name='Author']"> <xsl:value-of select="following::attribute[@name='Author']"/> </xsl:if> </author> </book> </xsl:template> 
0
source

All Articles