I am trying to parse XML data using Python, which uses prefixes, but not every file has a prefix declaration. XML example:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I use xml.etree.ElementTree to parse these files, but whenever the prefix is ββnot declared properly, ElementTree generates a parsing error. ( unbound prefix, at the beginning <abc:thing2>) Searching for this error leads me to solutions suggesting correcting the namespace declaration. However, I do not control the XML I need to work with, so modifying the input files is not a viable option.
Searching for parsing a namespace in the general case leads to many questions about searching in a namespace β an agnostic path that I don't need.
I am looking for a way to automatically parse these files, even if the namespace declaration is broken. I thought about doing the following:
- tell ElementTree which namespaces to expect in advance, because I know which ones can happen. I found
register_namespace, but it does not work. - read the full DTD before parsing and see if it allows it. I could not find a way to do this with ElementTree.
- tell ElementTree not to worry about namespaces at all. This should not cause problems with my data, but I did not find a way to do this.
- use some other parsing library that can handle this problem, although I prefer not to install additional libraries. I find it hard to see from the documentation if someone else can solve my problem.
- - , ?
UPDATE:
, Har07 lxml, , , , :
- , : "" , , . ( - , ) , , . , ,
xmlns, lxml.etree fromstring. , . , . - DTD :
lxml ( attribute_defaults, dtd_validation load_dtd), , , . lxml : recover. , , XML (. Har07)