Python (newbie) Parse XML from API call

Question

Python (newbie) Parse XML from API call

I have been looking for some lessons / other questions about the stack / documentation and still can't figure it out. ugh !!!

Executing an API request and parsing (want to assign to variables, but this is a bonus to this question). This is what I am trying to do. Why can't I provide a title and link for the items?

#!/usr/bin/python # Screen Scraper for Subs import urllib from xml.etree import ElementTree as ET show = 'heroes' season = '4' language = 'en' limit = '1' requestURL = 'http://api.allsubs.org/index.php?' \ + 'search=' + show \ + '+season+' + season \ + '&language=' + language \ + '&limit=' + limit root = ET.parse(urllib.urlopen(requestURL)).getroot() print root print '\n' items = root.findall('items') for item in items: item.find('title').text # should print: <![CDATA[Heroes Season 4 Subtitles]]> item.find('link').text # Should print: http://www.allsubs.org/subs-download/heroes+season+4/1223435/

XML response

  <AllSubsAPI> <title>AllSubs API: Subtitles Search</title> <link>http://www.allsubs.org</link> <description><![CDATA[Subtitles Search for Heroes Season 4]]></description> <language>en-us</language> <results>1</results> <found_results>24</found_results> <items> <item> <title><![CDATA[Heroes Season 4 Subtitles]]></title> <link>http://www.allsubs.org/subs-download/heroes+season+4/1223435/</link> <filename>heroes-season-4-english-heroes-season-4-en.zip</filename> <files_in_archive>Heroes - 4x01-02 - Orientation.HDTV.FQM.en.srt|Heroes - 4x17 - The Art of Deception.HDTV.2HD.en.srt|Heroes - 4x07 - Strange Attractors.HDTV.LOL.en.srt|Heroes - 4x08 - Once Upon a Time in Texas.HDTV.2HD.en.srt|Heroes - 4x07 - Strange Attractors.720p HDTV.DIMENSION.en.srt|Heroes - 4x05 - Hysterical Blindness.720p HDTV.X264.en.srt|Heroes - 4x09 - Shadowboxing.HDTV.LOL.en.srt|Heroes - 4x16 - Pass Fail.HDTV.LOL.en.srt|Heroes - 4x04 - Acceptance.HDTV.en.srt|Heroes - 4x01-02 - Orientation.720p HDTV.DIMENSION.en.srt|Heroes - 4x06 - Tabula Rasa.HDTV.NoTV.en.srt|Heroes - 4x10 - Brother Keeper.HDTV.FQM.en.srt|Heroes - 4x04 - Acceptance.HDTV.FQM.en.srt|Heroes - 4x14 - Let It Bleed.720p HDTV.DIMENSION.en.srt|Heroes - 4x06 - Tabula Rasa.720p HDTV.SiTV.en.srt|Heroes - 4x08 - Once Upon a Time in Texas.HDTV.NoTV.en.srt|Heroes - 4x12 - The Fifth Stage.HDTV.LOL.en.srt|Heroes - 4x19 - Brave New World.HDTV.LOL.en.srt|Heroes - 4x15 - Close to You.720p HDTV.DIMENSION.en.srt|Heroes - 4x03 - Ink.720p HDTV.DIMENSION.en.srt|Heroes - 4x11 - Thanksgiving.720p HDTV.DIMENSION.en.srt|Heroes - 4x13 - Upon This Rock.720p HDTV.DIMENSION.en.srt|Heroes - 4x13 - Upon This Rock.HDTV.LOL.en.srt|Heroes - 4x14 - Let It Bleed.HDTV.LOL.en.srt|Heroes - 4x15 - Close to You.HDTV.LOL.en.srt|Heroes - 4x12 - The Fifth Stage.720p HDTV.DIMENSION.en.srt|Heroes - 4x18 - The Wall.HDTV.LOL.en.srt|Heroes - 4x08 - Once Upon a Time in Texas.720p HDTV.CTU.en.srt|Heroes - 4x17 - The Art of Deception.HDTV.CTU.en.srt|Heroes - 4x09 - Shadowboxing.720p HDTV.DIMENSION.en.srt|Heroes - 4x10 - Brother Keeper.720p HDTV.DIMENSION.en.srt|Heroes - 4x04 - Acceptance.720p HDTV.CTU.en.srt|Heroes - 4x11 - Thanksgiving.HDTV.FQM.en.srt|Heroes - 4x03 - Ink.HDTV.FQM.en.srt|Heroes - 4x05 - Hysterical Blindness.HDTV.XII.en.srt|</files_in_archive> <languages>en</languages> <added_on>2010-02-16</added_on> </item> </items> </AllSubsAPI>

UPDATE:

It worked, thanks for the help and pointing out my typo

 items = root.findall('items/item') for item in items: print item.find('title').text print item.find('link').text

+6

python xml api parsing

Phill pafford Dec 15 '10 at 15:26

source share

3 answers

This works for me. Note. I am using urllib2 for the proxy:

 import urllib2 from xml.etree import ElementTree as ET show = 'heroes' season = '4' language = 'en' limit = '1' requestURL = 'http://api.allsubs.org/index.php?' \ + 'search=' + show \ + '+season+' + season \ + '&language=' + language \ + '&limit=' + limit root = ET.parse(urllib2.urlopen(requestURL)).getroot() print root print '\n' items = root.findall('items')[0].findall('item') for item in items: print item.find('title').text # should print: <![CDATA[Heroes Season 4 Subtitles]]> print item.find('link').text # Should print: http://www.allsubs.org/subs-download/heroes+season+4/1223435/

note that findall ('items') finds the “items” tag, what you want to loop (I think) is the “item” tag in it, so we find all () of them. In addition, you need to print to get something from python.

Also, if I do this with limit = 2, I get a:

 Traceback (most recent call last): File "heros.py", line 18, in <module> root = ET.parse(urllib2.urlopen(requestURL)).getroot() File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 862, in parse tree.parse(source, parser) File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 586, in parse parser.feed(data) File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1245, in feed self._parser.Parse(data, 0) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 24, column 95

I'm not sure that the XML returning from this API is well-formed - there is no "xml" element at the beginning for starters. I would not trust this ...

+3

Spacedman Dec 15 '10 at 15:43

source share

You do not iterate the items, you actually iterate the items.

I think it should be:

 items = root.findall('items') childItems = items.findall('item') for childItem in childItems: childItem.find('title').text # should print: <![CDATA[Heroes Season 4 Subtitles]]> childItem.find('link').text # Should print: http://www.allsubs.org/subs-download/heroes+season+4/1223435

+2

Mikeyg36 Dec 15 '10 at 15:34

source share

Adam vandenberg · Accepted Answer · 2010-12-15T15:28:33+0000

 items = root.findall('items')

it should be

 items = root.findall('items/item')

Python (newbie) Parse XML from API call

More articles: