Removing Processing Instructions Using Python lxml

I use the python lxml library to convert XML files to a new schema, but I ran into problems when parsing processing instructions from the XML body.

Elements of the processing instruction are scattered throughout XML, as in the following example (they all start with "oasys" and end with unique code):

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>" 

I cannot find them through the lxml.etree.findall() method, although etree.getchildren() returns them:

 tree = lxml.etree.fromstring(string) print tree.findall(".//") >>>> [<Element i at 0x747c>] print tree.getchildren() >>>> [<?oasys _dc21-?>, <Element i at 0x747x>] print tree.getchildren()[0].tag >>>> <built-in function ProcessingInstruction> print tree.getchildren()[0].tail >>>> Text 

Is there an alternative to using getchildren() to parse and delete processing instructions, especially considering that they are nested at different levels of XML?

+5
source share
1 answer

You can use the processing-instruction() XPath node test to find the processing instructions and remove them with etree.strip_tags() .

Example:

 from lxml import etree string = "<text><?oasys _dc21-?>Text <i>contents</i></text>" tree = etree.fromstring(string) pis = tree.xpath("//processing-instruction()") for pi in pis: etree.strip_tags(pi.getparent(), pi.tag) print etree.tostring(tree) 

Output:

 <text>Text <i>contents</i></text> 
+6
source

All Articles