How to iterate an html dataset in Python

This is the first time I'm trying to pick up some Python skills; please be kind to me :-)

While I am not completely unfamiliar with programming concepts (I used to work with PHP), switching to Python turned out to be somewhat difficult for me. I suppose this is mainly due to the fact that I am missing most, if not all, of the basic understanding of common "design patterns" (?) Etc.

Having said that, this is a problem. Part of my current project involves writing a simple scraper using Beautiful Soup. The data to be processed has a somewhat similar structure to that described below.

<table>
    <tr>
        <td class="date">2011-01-01</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr>
        <td class="date">2011-01-02</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
</table>

, , 1) (tr- > td class= "date" ), 2) tr: s (tr class= "item" → td class= "headline" tr class= "item" → td class= "link" ) 3) .

, , :

, crud: , , , : -)

. , , , - , , "" :-)

, , noobish.

+5
2

, , . , . , , .

-

  • tabledata 'date', last_seen_date
  • , (last_seen_date, , )

.

import BeautifulSoup

fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))

items = []
last_seen_date = None
for el in soup.findAll('tr'):
    daterow = el.find('td', {'class':'date'})
    if daterow is None:     # not a date - get headline and link
        headline = el.find('td', {'class':'headline'}).text
        link = el.find('a').get('href')
        items.append((last_seen_date, headline, link))
    else:                   # get new date
        last_seen_date = daterow.text
+5

Element Tree, python.

http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree

tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements
    #So we're going to loop over them and check their attributes
    if 'class' in eachTableRow.attrib:
        #Good to go. Now we know to look for the headline and link
        pass
    else:
        #Okay, so look for the date
        pass

, .

+2

All Articles