How to extract specific parts of a web page in Python

Question

How to extract specific parts of a web page in Python

Destination webpage: http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm

The section I want to extract:

<tr> <td>Skilled &ndash; Independent (Residence) subclass 885<br />online</td> <td>N/A</td> <td>N/A</td> <td>N/A</td> <td>15 May 2011</td> <td>N/A</td> </tr>

Once the code finds this section by searching for the keyword " subclass 885
online, "he must then print the date, which is in the 5th tag, which is" May 15, 2011, "as shown above.

This is just a monitor for me to monitor the progress of my immigration application.

+7

python string html

jiaoziren Aug 14 '11 at 4:42

source share

3 answers

You might want to use this as a starting point:

 Python 2.6.7 (r267:88850, Jun 13 2011, 22:03:32) [GCC 4.6.1 20110608 (prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib2, re >>> from BeautifulSoup import BeautifulSoup >>> urllib2.urlopen('http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm') <addinfourl at 139158380 whose fp = <socket._fileobject object at 0x84aa2ac>> >>> html = _.read() >>> soup = BeautifulSoup(html) >>> soup.find(text = re.compile('\\bsubclass 885\\b')).parent.parent.find('td', text = re.compile(' [0-9]{4}$')) u'15 May 2011'

+6

jcomeau_ictx Aug 14 '11 at 5:00

source share

There is a library called Beautiful Soup that does the job you requested. http://www.crummy.com/software/BeautifulSoup/

+2

Anil Shanbhag Aug 14 '11 at 4:45

source share

Johnsyweb · Accepted Answer · 2011-08-14T06:16:51+0000

" Beau - ootiful Soo - oop!
Beau - ootiful Soo - oop!
Su - oop e - e - evenings,
Beautiful, beautiful - FUL SOUP! "

- Lewis Carroll, Alice Adventures in Wonderland

I think this is exactly what he had in mind!

Mock turtle is likely to do something like this:

 >>> from BeautifulSoup import BeautifulSoup >>> import urllib2 >>> url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm' >>> page = urllib2.urlopen(url) >>> soup = BeautifulSoup(page) >>> for row in soup.html.body.findAll('tr'): ... data = row.findAll('td') ... if data and 'subclass 885online' in data[0].text: ... print data[4].text ... 15 May 2011

But I'm not sure that this will help, as this date has already passed!

Good luck with the app!

How to extract specific parts of a web page in Python

More articles: