I am trying to get this http://www.datamystic.com/timezone/time_zones.html table in array format so that I can do whatever I want. Preferably in PHP, python or JavaScript.
This is a problem that arises a lot, so instead of looking for help on this particular problem, I am looking for ideas on how to solve all such problems.
BeautifulSoup is the first thing that comes to mind. Another possibility is copying / pasting into TextMate, and then running regular expressions.
What do you suggest?
This is a script that I wrote, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup import urllib2 url = 'http://www.datamystic.com/timezone/time_zones.html'; response = urllib2.urlopen(url) html = response.read() soup = BeautifulSoup(html) tables = soup.findAll("table") table = tables[1] rows = table.findAll("tr") for row in rows: tds = row.findAll('td') if(len(tds)==4): countrycode = tds[1].string timezone = tds[2].string if(type(countrycode) is not type(None) and type(timezone) is not type(None)): print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improving my welcome in Python too;)
python regex html-parsing beautifulsoup
Zack burt
source share