HTML inside node using ElementTree

Question

HTML inside node using ElementTree

I am using ElementTree to parse an XML file. Some fields will display HTML data. For example, consider the expression as follows:

<Course> <Description>Line 1<br />Line 2</Description> </Course>

Now, assuming _course is the Element variable that holds this Couse element. I want to access this course description, so I:

 desc = _course.find("Description").text;

But then desc contains only "Line 1". I read something about the .tail attribute, so I also tried:

 desc = _course.find("Description").tail;

And I get the same result. What should I do to describe Line 1
Line 2 "(or literally anything between and)? In other words, I'm looking for something similar to the .innerText property in C # (and many other languages, which I suppose).

+4

python html xml elementtree

Rafael almeida Jul 6 '09 at 18:17

source share

4 answers

You are trying to read the tail attribute from the wrong element. Try

 desc = _course.find("br").tail;

The tail attribute is used to store the final text nodes when reading XML files of mixed content; The text that follows immediately after the element is stored in the tail attribute for that element:

  <tag> <elem> this goes into elem's
     text attribute </elem> this goes into
     elem's tail attribute </tag>

A simple code snippet for printing text and tail attributes from all elements in xml / xhtml.

  import xml.etree.ElementTree as ET

 def processElem (elem):
     if elem.text is not None:
         print elem.text
     for child in elem:
         processElem (child)
         if child.tail is not None:
             print child.tail

 xml = '' '<Course>
     <Description> Line 1 <br /> Line 2 <span> child text </span> child tail </Description>
     </Course> '' '

 root = ET.fromstring (xml)
 processElem (root)

Output:

  Line 1
 Line 2 
 child text 
 child tail

See http://code.activestate.com/recipes/498286-elementtree-text-helper/ for a better solution. It can be changed as required.

PS I changed my name from user839338 as indicated in the next post

+3

Dan-dev Jul 11 '11 at 17:13

source share

Characters such as "<" and "&" are illegal in XML elements.

"<" will generate an error because the parser interprets it as the beginning of a new element.

"&" will generate an error because the parser interprets it as the beginning of a character entity.

Some texts, such as JavaScript code, contain many "<" or "&" characters. To avoid script errors, the code can be defined as CDATA.

Everything inside the CDATA section is ignored by the parser.

The CDATA section begins with "":

Additional information: http://www.w3schools.com/xmL/xml_cdata.asp

Hope this helps!

+1

ylebre Jul 6 '09 at 18:25

source share

Inspired by user839338 answer , I did not find and was looking for a reasonable solution that looks something like this.

 >>> from xml.etree import ElementTree as etree >>> corpus = '''<Course> ... <Description>Line 1<br />Line 2</Description> ... </Course>''' >>> >>> doc = etree.fromstring(corpus) >>> desc = doc.find("Description") >>> desc.tag = 'html' >>> etree.tostring(desc) '<html>Line 1<br/>Line 2</html>\n' >>>

There is no easy way to remove the surrounding tag (originally <Description> ), but it is easily modifiable into something that can be used as needed, such as <div> or <span>

+1

SingleNegationElimination Jul 15 '11 at 17:46

source share

Dana the sane · Accepted Answer · 2009-07-06T18:22:37+0000

Do you have control over creating an xml file? The content of xml tags containing xml tags (or similar) or markup tags (' < ', etc.) must be encoded to avoid this problem. You can do this with:

a CDATA section
Base64 or some other encoding (which does not include reserved xml characters)
Entity Encoding (' < ' == ' < ')

If you cannot make these changes, and ElementTree cannot ignore tags that are not included in the xml schema, you will have to pre-process the file. Of course, you're out of luck if the scheme overlaps html.

HTML inside node using ElementTree

More articles: