Is there a module like Perl LWP for Ruby?

Perl has an LWP module :

The libwww-perl collection is a set of Perl modules that provides a simple and consistent application programming interface (API) on the World Wide Web. The focus of the library is to provide classes and functions that allow WWW clients to write. The library also contains modules that have more general usage and even classes that will help you implement simple HTTP servers.

Is there a similar module (gem) for Ruby?

Update

Here is an example of a function that I created that retrieves a URL from a specific website.

use LWP::UserAgent; use HTML::TreeBuilder 3; use HTML::TokeParser; sub get_gallery_urls { my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent("$0/0.1 " . $ua->agent); $ua->agent("Mozilla/8.0"); my $req = new HTTP::Request 'GET' => "$url"; $req->header('Accept' => 'text/html'); # send request $response_u = $ua->request($req); die "Error: ", $response_u->status_line unless $response_u->is_success; my $root = HTML::TreeBuilder->new; $root->parse($response_u->content); my @gu = $root->find_by_attribute("id", "thumbnails"); my %urls = (); foreach my $g (@gu) { my @as = $g->find_by_tag_name('a'); foreach $a (@as) { my $u = $a->attr("href"); if ($u =~ /^\//) { $urls{"http://example.com"."$u"} = 1; } } } return %urls; } 
+7
source share
4 answers

The closest match is probably httpclient , whose goal is the equivalent of LWP. However, depending on what you plan to do, there may be better options. If you plan to keep track of links, fill out forms, etc. To clear web content, you can use Mechanize , which is similar to the perl module with the same name. There are also more Ruby gems, such as the excellent Rest-client and HTTParty (my personal favorite). See the Ruby Toolbox HTTP Client Category for a larger list.

Refresh . Here's an example of how to find all the links on a page in Mechanize (Ruby, but that would be similar to Perl):

 require 'rubygems' require 'mechanize' agent = Mechanize.new page = agent.get('http://example.com/') page.links.each do |link| puts link.text end 

PS As a former ex-Perler, I was worried about abandoning the excellent CPAN - would I paint myself in a corner with Ruby? Can't I find the equivalent of the module I rely on? This turned out to be not quite a problem, but in fact recently it has been quite the opposite: Ruby (along with Python) tends to be the first to receive customer support for new platforms / web services, etc.

+10
source

Here your function may look like a ruby.

 require 'rubygems' require "mechanize" def get_gallery_urls url ua = Mechanize.new ua.user_agent = "Mozilla/8.0" urls = {} doc = ua.get url doc.search("#thumbnails a").each do |a| u = a["href"] urls["http://example.com#{u}"] = 1 if u =~ /^\// end urls end 

Much nicer :)

+3
source

I used Perl for years and years and I liked LWP. It was a great tool. However, here is how I am going to extract the urls from the page. This is not a spidering site, but it would be easy:

 require 'open-uri' require 'uri' urls = URI.extract(open('http://example.com').read) puts urls 

As a result, the result is as follows:

  http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
 http://www.w3.org/1999/xhtml
 http://www.icann.org/
 mailto: iana@iana.org ? subject = General% 20website% 20feedback

Writing this method:

 require 'open-uri' require 'uri' def get_gallery_urls(url) URI.extract(open(url).read) end 

or, closer to the original function, making it a Ruby-way:

 def get_gallery_urls(url) URI.extract(open(url).read).map{ |u| URI.parse(u).host ? u : URI.join(url, u).to_s } end 

or, following closer to the source code:

 require 'nokogiri' require 'open-uri' require 'uri' def get_gallery_urls(url) Nokogiri::HTML( open(url) ) .at('#thumbnails') .search('a') .map{ |link| href = link['href'] URI.parse(link[href]).host \ ? href \ : URI.join(url, href).to_s } end 

One of the things that attracted me to Ruby was the ability to read, but still be concise.

If you want to collapse your own TCP / IP-based features, the Ruby Net standard library is the starting point. By default, you get:

  net / ftp
 net / http
 net / imap
 net / pop
 net / smtp
 net / telnet

using ssh, scp, sftp based on SSL and others available as gems. Use gem search net -r | grep ^net- gem search net -r | grep ^net- to see a short list.

+3
source

This is more suitable for those who are looking at this question and need to find out what constitutes simpler / better / different alternatives to general web-crossing with Perl compared to using LWP (and even WWW::Mechanize ) .

Below is a quick selection of web filters on CPAN :

NB. Above all in alphabetical order, so please choose your favorite poison :)

For most of my recent web clips, I used pQuery . You can see that there are many examples of use on SO .

The following is an example of get_gallery_urls using pQuery :

 use strict; use warnings; use pQuery; sub get_gallery_urls { my $url = shift; my %urls; pQuery($url) ->find("#thumbnails a") ->each( sub { my $u = $_->getAttribute('href'); $urls{'http://example.com' . $u} = 1 if $u =~ /^\//; }); return %urls; } 

PS. As Daxim said in the comments, there are many excellent Perl web scraping tools. The hardest part is simply choosing which one to use!

+1
source

All Articles