The fastest, easiest, and best way to parse an HTML table?

Question

The fastest, easiest, and best way to parse an HTML table?

I am trying to get this http://www.datamystic.com/timezone/time_zones.html table in array format so that I can do whatever I want. Preferably in PHP, python or JavaScript.

This is a problem that arises a lot, so instead of looking for help on this particular problem, I am looking for ideas on how to solve all such problems.

BeautifulSoup is the first thing that comes to mind. Another possibility is copying / pasting into TextMate, and then running regular expressions.

What do you suggest?

This is a script that I wrote, but as I said, I'm looking for a more general solution.

from BeautifulSoup import BeautifulSoup import urllib2 url = 'http://www.datamystic.com/timezone/time_zones.html'; response = urllib2.urlopen(url) html = response.read() soup = BeautifulSoup(html) tables = soup.findAll("table") table = tables[1] rows = table.findAll("tr") for row in rows: tds = row.findAll('td') if(len(tds)==4): countrycode = tds[1].string timezone = tds[2].string if(type(countrycode) is not type(None) and type(timezone) is not type(None)): print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())

Comments and suggestions for improving my welcome in Python too;)

+9

python regex html-parsing beautifulsoup

Zack burt Feb 04 '11 at 0:19

source share

6 answers

Avoid regular expressions for HTML parsing, they just don't work for it, you need a DOM parser like BeautifulSoup ...

Several other alternatives

SimpleHTMLDom PHP
Hpricot and Nokogiri Ruby
Web :: Perl / CPAN Scraper

All of them are quite tolerant of poorly formed HTML.

+4

ocodo Feb 04 '11 at 0:23

source share

I suggest loading the document using an XML parser such as DOMDocument :: loadHTMLFile, which is associated with PHP, and then use XPath for the grep data you need.

This is not the fastest way, but the most readable (in my opinion) at the end. You can use Regex, which will probably be a little faster, but will have a bad style (hard to debug, hard to read).

EDIT: This is actually complicated because the page you specified is not valid HTML (see validator.w3.org). Especially tags that do not have an opening / closing tag interfere greatly.

It seems that xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) can fix the problem (run xmlstarlet fo -R). xmlstarlet can also execute xpath and xslt script, which can help you retrieve your data with a simple shell script.

0

yankee Feb 04 '11 at 0:25

source share

There are many options.

I created a benchmark when creating serpapi.com https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd

0

victor benarbia Dec 7 '18 at 23:04

source share

While we were creating SerpAPI, we tested many platforms / parsers.

Here is the test result for Python.

For more information, here is the full article on average: https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd

0

jvmvik Dec 9 '18 at 23:10

source share

Regular expression performance is superior to the DOM parser.

Take a look at this comparison:

http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team

You can find many more searches on the Internet.

-2

Gustavo costa de oliveira Feb 04 '11 at 0:32

source share

Steven · Accepted Answer · 2011-02-04T10:33:58+0000

For your common problem: try lxml.html from the lxml package (imagine it as stdlibs xml.etree on steroids: same xml api, but with support for html, xpath, xslt, etc.)

A quick example for your specific case:

 from lxml import html tree = html.parse('http://www.datamystic.com/timezone/time_zones.html') table = tree.findall('//table')[1] data = [ [td.text_content().strip() for td in row.findall('td')] for row in table.findall('tr') ]

This will give you a nested list: each sublist corresponds to a row in the table and contains data from the cells. Secretly inserted ad lines are not filtered out yet, but this should help you with that. (and by the way: lxml is fast!)

BUT: more specifically for your specific use case: there is a better way to get information about the time zone database than clearing this particular web page (in addition: note that the web page actually mentions that you do not have the right to copy it content). There are even libraries that already use this information; see, for example, python-dateutil .

The fastest, easiest, and best way to parse an HTML table?

More articles: