Removing Processing Instructions Using Python lxml

Question

Removing Processing Instructions Using Python lxml

I use the python lxml library to convert XML files to a new schema, but I ran into problems when parsing processing instructions from the XML body.

Elements of the processing instruction are scattered throughout XML, as in the following example (they all start with "oasys" and end with unique code):

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"

I cannot find them through the lxml.etree.findall() method, although etree.getchildren() returns them:

 tree = lxml.etree.fromstring(string) print tree.findall(".//") >>>> [<Element i at 0x747c>] print tree.getchildren() >>>> [<?oasys _dc21-?>, <Element i at 0x747x>] print tree.getchildren()[0].tag >>>> <built-in function ProcessingInstruction> print tree.getchildren()[0].tail >>>> Text

Is there an alternative to using getchildren() to parse and delete processing instructions, especially considering that they are nested at different levels of XML?

+5

python xml lxml

meng_die Jul 20 '15 at 16:59

source share

1 answer

mzjn · Accepted Answer · 2015-07-20T18:46:00+0000

You can use the processing-instruction() XPath node test to find the processing instructions and remove them with etree.strip_tags() .

Example:

 from lxml import etree string = "<text><?oasys _dc21-?>Text <i>contents</i></text>" tree = etree.fromstring(string) pis = tree.xpath("//processing-instruction()") for pi in pis: etree.strip_tags(pi.getparent(), pi.tag) print etree.tostring(tree)

Output:

 <text>Text <i>contents</i></text>

Removing Processing Instructions Using Python lxml

More articles: