What is the easiest way to programmatically extract structured data from a heap of web pages?

Question

What is the easiest way to programmatically extract structured data from a heap of web pages?

I am currently using the Adobe AIR program that I wrote to track links on one page and grab a section of data from subsequent pages. This really works well, and for programmers, I believe that this (or other languages) provides a reasonable approach that needs to be written on an individual basis. Perhaps there is a certain language or library that allows the programmer to do this very quickly, and if so, I would be interested to know what it is.

Also, are there any tools that would allow a non-programmer, such as a customer service representative or someone who was responsible for collecting data, to extract structured data from web pages without having to do a bunch of copy and paste?

+7

java c # flex perl air

dennisjtaylor Dec 18 '09 at 19:42

source share

6 answers

I found YQL very powerful and useful for this kind of thing. You can select any web page from the Internet, and it will make it valid, and then allow you to use XPATH to request its sections. You can output it as XML or JSON for loading into another script application.

I wrote my first experiment here:

http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/

Since then, YQL has become more powerful with the addition of the EXECUTE keyword, which allows you to write your own logic in javascript and run it on Yahoo servers before returning data to you.

A more detailed YQL entry is here .

You can create a datatable for YQL to familiarize yourself with the basics of the information you are trying to capture, and then the data collector could write very simple queries (in DSL, which is pretty much English) against this table. It would be easier for them than "proper programming", at least ...

+2

vitch Dec 18 '09 at 19:56

source share

There is Sprog , which allows you to graphically create processes from parts (Get URL → Processing HTML table → Record file), and you can put Perl code at any stage of the process or write your own parts for use by a non-programmer. It looks a bit deserted, but still works well.

+2

Mkv Dec 19 '09 at 12:42

source share

I use a combination of Ruby with hpricot and watir does the job very efficiently

0

Alon Dec 18 '09 at 19:49

source share

If you don't mind it taking over your computer and you need javasript support, WatiN is a pretty damn good browsing tool. Written in C #, it has been very reliable for me in the past, providing a beautiful browser-independent shell for launching and retrieving text from pages.

0

Robert P Dec 18 '09 at 10:55

source share

Are commercial tools viable answers? If so, see http://screen-scraper.com/ , it is very simple to set up and use to clean websites. They have a free version, which is actually quite complete. And no, I'm not connected with the company :)

0

Ryan k Dec 23 '09 at 5:16

source share

draegtun · Accepted Answer · 2009-12-18T20:19:31+0000

If you search Stackoverflow for WWW::Mechanize and pQuery you will see many examples using these Perl CPAN modules.

However, since you mentioned "non-programmer," then perhaps the Web::Scraper CPAN module might be more appropriate? Its more DSL , as well as possibly easier for a "non-programmer" to pick up.

Here is an example from the documentation for extracting tweets from Twitter:

 use URI; use Web::Scraper; my $tweets = scraper { process "li.status", "tweets[]" => scraper { process ".entry-content", body => 'TEXT'; process ".entry-date", when => 'TEXT'; process 'a[rel="bookmark"]', link => '@href'; }; }; my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") ); for my $tweet (@{$res->{tweets}}) { print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n"; }

What is the easiest way to programmatically extract structured data from a heap of web pages?

More articles: