How to find an xml node that does not have an attribute

I am using python 2.7 and trying to parse the XML below - what I'm trying to do is create a python array of all genres with a language attribute along with an array that does not have a language attribute.

I am using the python module import xml.etree.cElementTree as ET

I know that I can find the XML section where the language attribute is in the language "fr" via the syntax:

 tree = ET.ElementTree (file = 'popups.xml')
 root = tree.getroot ()
 for x in root.findall ('alt [@ {http://www.w3.org/XML/1998/namespace} lang = "fr"] / alt'):
    print x.text

I really don’t understand why I can’t use xml:lang and not {http://www.w3.org/XML/1998/namespace}lang , but the above seems to work on Ubuntu 12.04

What I'm trying to figure out is the "not" syntax, where the XML section has no language attribute

Does anyone have any thoughts on how to achieve this?

 <genre> <alt> <alt genre="easy listening">lounge</alt> <alt genre="alternative">ska</alt> </alt> <alt xml:lang="fr"> <alt genre="gospel">catholique</alt> </alt> </genre> 
+4
source share
2 answers

You need to use the full QName in your xpath because stdlib ElementTree has no way to register the prefix. I usually use a helper function to create QNames:

 def qname(prefix, element, map={'xml':'http://www.w3.org/XML/1998/namespace'}): return "{{{}}}{}".format(map[prefix], element) 

The implementation of ElementTree in the standard library does not support XPath enough to do what you want easily. However, the spec for xml:lang indicates that the value of this attribute is inherited by everything that contains it, sort of like xml:base or xmlns namespace declarations. Thus, we can make the language setting explicit for all elements:

 xml_lang = qname('xml', 'lang') def set_xml_lang(root, defaultlang=''): xml_lang = qname('xml', 'lang') for item in root: try: lang = item.attrib[xml_lang] except KeyError, err: item.set(xml_lang, defaultlang) lang = defaultlang set_xml_lang(item, lang) set_xml_lang(root) namespaces = {'xml':'http://www.w3.org/XML/1998/namespace'} # Every element in root now has an xml:lang attribute # so XPath is easy now: alts_with_no_lang = root.findall('alt[@{{{xml}}}lang=""]'.format(**namespaces)) 

If you want to use lxml , your use of "lang" can be much more reliable as it follows the full XPath 1.0 spec. In particular, you can use the lang() function:

 import lxml.etree as ET root = ET.fromstring(xml) print root.xpath('//alt[lang("fr")]') 

As a bonus, it will have the correct lang() semantics, such as case insensitivity and language skills (for example, lang('en') will be true for xml:lang="en-US" ).

Unfortunately, you cannot use lang() to define the language node. You need to find the first ancestor of xml:lang and use it:

 mylang = node.xpath('(ancestor-or-self::*/@xml:lang)[1]') 

Putting it all together to combine nodes that don't have a language:

 tree.xpath('//alt[not((ancestor-or-self::*/@xml:lang)[1])]') 
+4
source

I really don’t understand why I can’t use xml: lang and not {http://www.w3.org/XML/1998/namespaceasketlang, but the above seems to work on Ubuntu 12.04

What you are trying to do will be simpler using the xpath method (which is not available in cElementTree ), which, among other things, will read namespace labels from the root element of your document so you can ask this:

 import lxml.etree as et root = et.parse(open('mydoc.xml')).getroot() for x in root.xpath('alt[not(@xml:lang)]/alt'): print x.text 

The syntax is not(@attr) , which I did not know about before, but a Google search for an xpath search element without an attribute was extremely useful.

+1
source

All Articles