I used Perl for years and years and I liked LWP. It was a great tool. However, here is how I am going to extract the urls from the page. This is not a spidering site, but it would be easy:
require 'open-uri' require 'uri' urls = URI.extract(open('http://example.com').read) puts urls
As a result, the result is as follows:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/1999/xhtml
http://www.icann.org/
mailto: iana@iana.org ? subject = General% 20website% 20feedback
Writing this method:
require 'open-uri' require 'uri' def get_gallery_urls(url) URI.extract(open(url).read) end
or, closer to the original function, making it a Ruby-way:
def get_gallery_urls(url) URI.extract(open(url).read).map{ |u| URI.parse(u).host ? u : URI.join(url, u).to_s } end
or, following closer to the source code:
require 'nokogiri' require 'open-uri' require 'uri' def get_gallery_urls(url) Nokogiri::HTML( open(url) ) .at('#thumbnails') .search('a') .map{ |link| href = link['href'] URI.parse(link[href]).host \ ? href \ : URI.join(url, href).to_s } end
One of the things that attracted me to Ruby was the ability to read, but still be concise.
If you want to collapse your own TCP / IP-based features, the Ruby Net standard library is the starting point. By default, you get:
net / ftp
net / http
net / imap
net / pop
net / smtp
net / telnet
using ssh, scp, sftp based on SSL and others available as gems. Use gem search net -r | grep ^net- gem search net -r | grep ^net- to see a short list.
the tin man
source share