What are some examples of using Nokogiri?

Question

What are some examples of using Nokogiri?

I am trying to understand Nokogiri. Does anyone have a link to a basic Nokogiri analysis or scripting example showing the resulting tree. I think this will really help my understanding.

+7

ruby nokogiri

user1094747 Dec 12 '11 at 23:43

source share

1 answer

the tin man · Accepted Answer · 2011-12-13T00:10:05+0000

Using IRB and Ruby 1.9.2:

Download Nokogiri:

1.9.2-p290 :001 > require 'nokogiri' true

Parse the document:

 1.9.2-p290 :002 > doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>') #<Nokogiri::HTML::Document:0x1012821a0 @node_cache = [], attr_accessor :errors = [], attr_reader :decorators = nil

Nokigiri loves well-formed documents. Please note that he added DOCTYPE because I parsed the document. It is also possible to analyze a fragment of a document, but it is rather specialized.

 1.9.2-p290 :003 > doc.to_html "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"

Find the document to find the first <p> node using CSS and grab its contents:

 1.9.2-p290 :004 > doc.at('p').text "foobar"

Use a different method name to do the same:

 1.9.2-p290 :005 > doc.at('p').content "foobar"

Search for a document for all <p> nodes inside the <body> and capture the contents of the first. search returns a collection of nodes that looks like an array of nodes.

 1.9.2-p290 :006 > doc.search('body p').first.text "foobar"

Change the contents of node:

 1.9.2-p290 :007 > doc.at('p').content = 'bar' "bar"

Extract parsed document as HTML:

 1.9.2-p290 :008 > doc.to_html "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"

Remove node:

 1.9.2-p290 :009 > doc.at('p').remove #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]> 1.9.2-p290 :010 > doc.to_html "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"

Regarding scraping, there are many questions about SO about using Nokogiri to break HTML code from sites. Finding StackOverflow for nokogiri and open-uri should help.

What are some examples of using Nokogiri?

More articles: