I use the python lxml library to convert XML files to a new schema, but I ran into problems when parsing processing instructions from the XML body.
Elements of the processing instruction are scattered throughout XML, as in the following example (they all start with "oasys" and end with unique code):
string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
I cannot find them through the lxml.etree.findall() method, although etree.getchildren() returns them:
tree = lxml.etree.fromstring(string) print tree.findall(".//") >>>> [<Element i at 0x747c>] print tree.getchildren() >>>> [<?oasys _dc21-?>, <Element i at 0x747x>] print tree.getchildren()[0].tag >>>> <built-in function ProcessingInstruction> print tree.getchildren()[0].tail >>>> Text
Is there an alternative to using getchildren() to parse and delete processing instructions, especially considering that they are nested at different levels of XML?
source share