How do you extract feed URLs from an OPML file exported from Google Reader?

I have a piece of software called Rss-Aware that I am trying to use. This is basically a desktop feed verification program that checks whether RSS feeds are updated and issues a notification via the Ubuntu Notify-OSD system.

However, to find out which feeds to check, you must list the feed URLs in the text file in ~ / .rss-aware / rssfeeds.txt one by one in the list with line breaks between each channel URL. Sort of:

http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml

... seems pretty simple? Well, the list of feeds that I would like to use is exported from Google Reader as an OPML file (this is an XML type), and I don’t know how to parse it just to display the feed URLs. It seems like it should be pretty straight forward, but I'm at a standstill.

I would like someone to be able to implement an implementation in Python or Ruby or something that I could do quickly from the tooltip. A bash script will be awesome.

Thank you very much for your help, I am a very weak programmer and would like to know how to do this basic analysis.

EDIT: Also, here is the OPML file I'm trying to extract feed URLs.

+5
source share
4 answers

. listparser, Python. OPML , , , . .

- feedparser, :

>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']

URL- , script, :

import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
    f.write(i.url + '\n')
f.close()

USERNAME . !

+4

XML, XPath URL-. XML , URL- rss- xmlUrl. XPath //@xmlUrl .

-, - XPath. XPath Python, , XPath Python. , lxml XPath lxml, .

+2

XML .

from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
    urls = []
    with open(filename, 'rt') as f:
        tree = ElementTree.parse(f)
    for node in tree.findall('.//outline'):
        url = node.attrib.get('xmlUrl')
        if url:
            urls.append(url)
    return urls
urls = extract_rss_urls_from_opml('your_file')
+2

. OPML Google Reader HTML- Firefox HTML:

^\s+<outline.*?title="(.*?)".*?xmlUrl="(.*?)".*?htmlUrl="(.*?)".*?/>
<DT><A FEEDURL="$2" HREF="$3">$1</A>
0

All Articles