Lxml element designation with namespaces

I am trying to use Lxml to parse the contents of a .docx document. I understand that lxml replaces the namespace prefixes with the actual namespace, however this creates a real pain to check which element tag I am working with. I would like to be able to do something like

if (someElement.tag == "w:p"): 

but since lxml insists on adding a full namespace, I need to either do something like

 if (someElemenet.tag == "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'): 

or search for the full namespace name from an attribute of an nsmap element like this

 targetTag = "{%s}p" % someElement.nsmap['w'] if (someElement.tag == targetTag): 

If there was an easier way to convince lxml to

  • Give me the tag string without the namespace added by it, I can use the prefix attribute along with this information to check which tag I'm working with OR
  • Just give me the tag string using the prefix

This will save a lot of keystrokes when writing this analyzer. Is it possible? Am I missing something in the documentation?

+6
python lxml xml-namespaces
source share
5 answers

Maybe use local-name () :

 import lxml.etree as ET tree = ET.fromstring('<root xmlns:f="foo"><f:test/></root>') elt=tree[0] print(elt.xpath('local-name()')) # test 
+18
source share

I could not find a way to get the tag name other than the names from the element - lxml considers the full part of the namespace of the tag name. Here are some options that may help.

You can also use the QName class to create a tag with names for comparison:

 import lxml.etree from lxml.etree import QName tree = lxml.etree.fromstring('<root xmlns:f="foo"><f:test/></root>') qn = QName(tree.nsmap['f'], 'test') assert tree[0].tag == qn 

If you need a bare tag name, you need to write a utility function to extract it:

 def get_bare_tag(elem): return elem.tag.rsplit('}', 1)[-1] assert get_bare_tag(tree[0]) == 'test' 

Unfortunately, as far as I know, you cannot search for tags with "any namespace" (for example, {*}test ) using the lxml xpath / find methods.

Updated . Please note: lxml will not create a tag containing only { or } - it will lead to the creation of a ValueError: invalid tag name, so we can safely assume that the element with the tag name starts with { balanced.

 lxml.etree.Element('{foo') ValueError: Invalid tag name 
+3
source share

etree.Qname should be able to get what you want.

 from lxml import etree # [...] tag = etree.QName(someElement) print(tag.namespace, tag.localname) 

In your example tag, this will be output:

 http://schemas.openxmlformats.org/wordprocessingml/2006/main p 

Note that QName will accept either an Element object or a string (for example, from Element.tag ).

And, as you noticed, you can also use Element.nsmap to map an arbitrary prefix to a namespace.

So something like this:

 if tag.namespace == someElement.nsmap["w"] and tag.localname == "p": 
+2
source share

To save time when looking for large-volume tags, such as p (paragraph, I suppose) in docx or c (cell) in xlsx, it is usually necessary to set the complete tag once at the global or class level

 WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}" tag_p = WPML_URI + 'p' tag_t = WPML_URI + 't' 

I have never seen an explanation of why you need to use QName() .

In the other direction, given the full tag, you can easily extract the base tag:

base_tag = full_tag.rsplit("}", 1)[-1]

+1
source share

I am not a Python expert, but I also had this problem (Windows 7 Contacts files). I wrote the following function for the lxml system.

This function takes an element and returns its tag with a prefix replaced by the tag of the ns file.

 from lxml import etree def denstag(ee): tag = ee.tag for ns in ee.nsmap: prefix = "{"+ee.nsmap[ns]+"}" if tag.startswith(prefix): return ns+":"+tag[len(prefix):] return tag 
+1
source share

All Articles