I am trying to parse XML using Beautifulsoup but hit a brick wall trying to use the recursive attribute "findall ()
I have a rather strange xml format shown below:
<?xml version="1.0"?> <catalog> <book> <author>Gambardella, Matthew</author> <title>XML Developer Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> <book>true</book> </book> <book> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> <book>false</book> </book> </catalog>
As you can see, the book tag is repeated inside the book tag, which causes an error when I try to do something like:
from BeautifulSoup import BeautifulStoneSoup as BSS catalog = "catalog.xml" def open_rss(): f = open(catalog, 'r') return f.read() def rss_parser(): rss_contents = open_rss() soup = BSS(rss_contents) items = soup.findAll('book', recursive=False) for item in items: print item.title.string rss_parser()
As you will see, on my soup.findAll I added recursive = false, which theoretically does not make it recursive through the found element, but proceeds to the next.
This does not work as I always get the following error:
File "catalog.py", line 17, in rss_parser print item.title.string AttributeError: 'NoneType' object has no attribute 'string'
I am sure that I am doing something stupid here, and I would be grateful if someone could help me in solving this problem.
Changing the HTML structure is not an option; this code should work well, as it will potentially parse a large XML file.
python xml nested beautifulsoup
Marcos placona
source share