Parsing RSS with Elementtree in Python

Question

Parsing RSS with Elementtree in Python

How are you looking for namespace-specific tags in XML using Elementtree in Python?

I have an XML / RSS document, for example:

<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.0/" > <channel> <title>sometitle</title> <pubDate>Tue, 28 Aug 2012 22:36:02 +0000</pubDate> <generator>http://wordpress.org/?v=2.5.1</generator> <language>en</language> <wp:wxr_version>1.0</wp:wxr_version> <wp:category><wp:category_nicename>apache</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[Apache]]></wp:cat_name></wp:category> </channel> </rss>

But when I try to find all the "wp: category" tags by doing:

 import xml.etree.ElementTree as xml tree = xml.parse(fn) doc = tree.getroot() categories = doc.findall('channel/wp:category')

I get an error message:

 SyntaxError: prefix 'wp' not found in prefix map

Finding any fields not related to the namespace works very well. What am I doing wrong?

+6

python xml rss elementtree

Cerin Oct 12 '12 at 14:56

source share

1 answer

Tom · Answer 1 · 2012-10-12T15:01:59+0000

You need to process namespace prefixes , either using iterparse, either handling the event directly, or explicitly declaring prefixes that you are interested in before parsing. Depending on what you are trying to do, I admit that in my lazy moments I just remove all the prefixes with the replacement of the string before parsing the XML.

EDIT: This similar question may help.

Parsing RSS with Elementtree in Python

More articles: