XML parsing with undeclared prefixes in Python

I am trying to parse XML data using Python, which uses prefixes, but not every file has a prefix declaration. XML example:

<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
    <thing>Word</thing>
    <abc:thing2>Another Word</abc:thing2>
</item>

I use xml.etree.ElementTree to parse these files, but whenever the prefix is ​​not declared properly, ElementTree generates a parsing error. ( unbound prefix, at the beginning <abc:thing2>) Searching for this error leads me to solutions suggesting correcting the namespace declaration. However, I do not control the XML I need to work with, so modifying the input files is not a viable option.

Searching for parsing a namespace in the general case leads to many questions about searching in a namespace β€” an agnostic path that I don't need.

I am looking for a way to automatically parse these files, even if the namespace declaration is broken. I thought about doing the following:

  • tell ElementTree which namespaces to expect in advance, because I know which ones can happen. I found register_namespace, but it does not work.
  • read the full DTD before parsing and see if it allows it. I could not find a way to do this with ElementTree.
  • tell ElementTree not to worry about namespaces at all. This should not cause problems with my data, but I did not find a way to do this.
  • use some other parsing library that can handle this problem, although I prefer not to install additional libraries. I find it hard to see from the documentation if someone else can solve my problem.
  • - , ?

UPDATE: , Har07 lxml, , , , :

  • , : "" , , . ( - , ) , , . , , xmlns, lxml.etree fromstring. , . , .
  • DTD : lxml ( attribute_defaults, dtd_validation load_dtd), , , .
  • lxml : recover. , , XML (. Har07)
+4
1

ElementTree, lxml. :

from lxml import etree as ElementTree

xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
    <thing>Word</thing>
    <abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)

thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))

, XML lxml, recover=True XMLParser. lxml xpath 1.0, , XML, .

:

XML, recover=True. , , : . lxml , , . , XML:

xml = """<item subtype="bla">
    <thing>Word</thing>
    <bad>
    <abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)

print(ElementTree.tostring(tree))

XML lxml :

<item subtype="bla">
    <thing>Word</thing>
    <bad>
    <abc:thing2>Another Word</abc:thing2>
</bad></item>
+4

All Articles