Nested Tags BeautifulSoup

I am trying to parse XML using Beautifulsoup but hit a brick wall trying to use the recursive attribute "findall ()

I have a rather strange xml format shown below:

<?xml version="1.0"?> <catalog> <book> <author>Gambardella, Matthew</author> <title>XML Developer Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> <book>true</book> </book> <book> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> <book>false</book> </book> </catalog> 

As you can see, the book tag is repeated inside the book tag, which causes an error when I try to do something like:

 from BeautifulSoup import BeautifulStoneSoup as BSS catalog = "catalog.xml" def open_rss(): f = open(catalog, 'r') return f.read() def rss_parser(): rss_contents = open_rss() soup = BSS(rss_contents) items = soup.findAll('book', recursive=False) for item in items: print item.title.string rss_parser() 

As you will see, on my soup.findAll I added recursive = false, which theoretically does not make it recursive through the found element, but proceeds to the next.

This does not work as I always get the following error:

  File "catalog.py", line 17, in rss_parser print item.title.string AttributeError: 'NoneType' object has no attribute 'string' 

I am sure that I am doing something stupid here, and I would be grateful if someone could help me in solving this problem.

Changing the HTML structure is not an option; this code should work well, as it will potentially parse a large XML file.

+6
python xml nested beautifulsoup
source share
3 answers

soup.findAll('catalog', recursive=False) will return a list containing only the top-level catalog tag. Since it does not have a child element "title", item.title is None .

Try soup.findAll("book") or soup.find("catalog").findChildren() .

Edit: Well, the problem is not what I was thinking. Try the following:

 BSS.NESTABLE_TAGS["book"] = [] soup = BSS(open("catalog.xml")) soup.catalog.findChildren(recursive=False) 
+3
source share

The problem seems to be with the book nested tags. BautifulSoup has a predefined set of tags that can be nested ( BeautifulSoup.NESTABLE_TAGS ), but it does not know that a book can be nested, so it becomes a winner.

Setting up the parser explains what is happening and how you can subclass BeautifulStoneSoup to customize the tags that you can set. Here's how we can use it to fix your problem:

 from BeautifulSoup import BeautifulStoneSoup class BookSoup(BeautifulStoneSoup): NESTABLE_TAGS = { 'book': ['book'] } soup = BookSoup(xml) # xml string omitted to keep this short for book in soup.find('catalog').findAll('book', recursive=False): print book.title.string 

If we run this, we get the following output:

 XML Developer Guide Midnight Rain 
+2
source share

Beautifulsoup is slow and dead, use lxml instead :)

 >>> from lxml import etree >>> rss = open('/tmp/catalog.xml') >>> items = etree.parse(rss).xpath('//book/title/text()') >>> items ["XML Developer Guide", 'Midnight Rain'] >>> 
0
source share

All Articles