How to navigate the DOM using Nokogiri

I am trying to populate the variables parent_element_h1 and parent_element_h2 . Can someone help me use Nokogiri to get the information I need in these variables?

 require 'rubygems' require 'nokogiri' value = Nokogiri::HTML.parse(<<-HTML_END) "<html> <body> <p id='para-1'>A</p> <div class='block' id='X1'> <h1>Foo</h1> <p id='para-2'>B</p> </div> <p id='para-3'>C</p> <h2>Bar</h2> <p id='para-4'>D</p> <p id='para-5'>E</p> <div class='block' id='X2'> <p id='para-6'>F</p> </div> </body> </html>" HTML_END parent = value.css('body').first # start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2 start_here = parent.at('div.block#X2') # this should be a Nokogiri::XML::Element of the nearest, previous h1. # in this example it the one with the value 'Foo' parent_element_h1 = # this should be a Nokogiri::XML::Element of the nearest, previous h2. # in this example it the one with the value 'Bar' parent_element_h2 = 

Note: the start_here element can be located anywhere inside the document. HTML data is just an example. However, the headers <h1> and <h2> may be start_here from start_here or a child node from start_here .

The following recursive method is a good starting point, but it does not work on <h1> because it is a child of the start_here child node:

 def search_element(_block,_style) unless _block.nil? if _block.name == _style return _block else search_element(_block.previous,_style) end else return false end end parent_element_h1 = search_element(start_here,'h1') parent_element_h2 = search_element(start_here,'h2') 

After accepting the answer, I came up with my own solution . It works like a charm, and I think it's pretty cool.

+7
dom ruby ruby-on-rails xpath nokogiri
source share
6 answers

I came across this after a few years, I suppose, but felt compelled to publish because all the other solutions are too complicated.

This is one statement with XPath:

 start = doc.at('div.block#X2') start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]') #=> <h2>Foo</h2> start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]') #=> <h2>Bar</h2> 

Both direct previous brothers and children of previous brothers and sisters can participate in it. Regardless of which one matches, the last() predicate ensures that you get the closest previous match.

+3
source share

The approach I will take (if I understand your problem) is to use XPath or CSS to search for your start_here element and the parent element you want to search for. Then recursively traverse the tree starting at the parent, stopping when you hit the start_here element and holding the last element that matches your style along the way.

Something like:

 parent = value.search("//body").first div = value.search("//div[@id = 'X2']").first find = FindPriorTo.new(div) assert_equal('Foo', find.find_from(parent, 'h1').text) assert_equal('Bar', find.find_from(parent, 'h2').text) 

Where FindPriorTo is a simple class for processing recursion:

 class FindPriorTo def initialize(stop_element) @stop_element = stop_element end def find_from(parent, style) @should_stop = nil @last_style = nil recursive_search(parent, style) end def recursive_search(parent, style) parent.children.each do |ch| recursive_search(ch, style) return @last_style if @should_stop @should_stop = (ch == @stop_element) @last_style = ch if ch.name == style end @last_style end end 

If this approach does not scale well, then you can optimize things by rewriting recursive_search so that you don’t use recursion, as well as transfer both the styles you are looking for and keep track of the last ones found, so you don’t have to go through the tree for extra time.

I will also say that I am trying to neutralize the Node monkey when the document is being processed, but it looks like it is all written in C. Perhaps you might be better off using something other than Nokogiri that has (possibly REXML ) , or if speed is your real problem, do a search in C / C ++ using Xerces or similar. I don’t know how well they will handle HTML analysis.

+10
source share

Maybe it will be. I am not sure about the performance and maybe some cases that I did not think about.

 def find(root, start, tag) ps, res = start, nil until res or (ps == root) ps = ps.previous || ps.parent res = ps.css(tag).last res ||= ps.name == tag ? ps : nil end res || "Not found!" end parent_element_h1 = find(parent, start_here, 'h1') 
+2
source share

This is my own decision (commendable to my co-worker for helping me with this!), Using a recursive method to analyze all the elements regardless of whether they are sibling or sibling of another brother.

 require 'rubygems' require 'nokogiri' value = Nokogiri::HTML.parse(<<-HTML_END) "<html> <body> <p id='para-1'>A</p> <div class='block' id='X1'> <h1>Foo</h1> <p id='para-2'>B</p> </div> <p id='para-3'>C</p> <h2>Bar</h2> <p id='para-4'>D</p> <p id='para-5'>E</p> <div class='block' id='X2'> <p id='para-6'>F</p> </div> </body> </html>" HTML_END parent = value.css('body').first # start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2 @start_here = parent.at('div.block#X2') # Search for parent elements of kind "_style" starting from _start_element def search_for_parent_element(_start_element, _style) unless _start_element.nil? # have we already found what we're looking for? if _start_element.name == _style return _start_element end # _start_element is a div.block and not the _start_element itself if _start_element[:class] == "block" && _start_element[:id] != @start_here[:id] # begin recursion with last child inside div.block from_child = search_for_parent_element(_start_element.children.last, _style) if(from_child) return from_child end end # begin recursion with previous element from_child = search_for_parent_element(_start_element.previous, _style) return from_child ? from_child : false else return false end end # this should be a Nokogiri::XML::Element of the nearest, previous h1. # in this example it the one with the value 'Foo' puts parent_element_h1 = search_for_parent_element(@start_here,"h1") # this should be a Nokogiri::XML::Element of the nearest, previous h2. # in this example it the one with the value 'Bar' puts parent_element_h2 = search_for_parent_element(@start_here,"h2") 

You can copy / paste it, run it as if it is a ruby ​​script.

0
source share

If you do not know the relationship between the elements, you can search for them this way (anywhere in the document):

 # html code text = "insert your html here" # get doc object doc = Nokogiri::HTML(text) # get elements with the specified tag elements = doc.search("//your_tag") 

If you need to submit a form, you should use mechanize:

 # create mech object mech = WWW::Mechanize.new # load site mech.get("address") # select a form, in this case, I select the first form. You can select the one you need # from the array form = mech.page.forms.first # you fill the fields like this: form.name_of_the_field form.element_name = value form.other_element = other_value 
-one
source share

You can search for descendants of Nokogiri HTML::Element using CSS selectors. You can cross ancestors using the .parent method.

 parent_element_h1 = value.css("h1").first.parent parent_element_h2 = value.css("h2").first.parent 
-one
source share

All Articles