The approach I will take (if I understand your problem) is to use XPath or CSS to search for your start_here element and the parent element you want to search for. Then recursively traverse the tree starting at the parent, stopping when you hit the start_here element and holding the last element that matches your style along the way.
Something like:
parent = value.search("//body").first div = value.search("//div[@id = 'X2']").first find = FindPriorTo.new(div) assert_equal('Foo', find.find_from(parent, 'h1').text) assert_equal('Bar', find.find_from(parent, 'h2').text)
Where FindPriorTo is a simple class for processing recursion:
class FindPriorTo def initialize(stop_element) @stop_element = stop_element end def find_from(parent, style) @should_stop = nil @last_style = nil recursive_search(parent, style) end def recursive_search(parent, style) parent.children.each do |ch| recursive_search(ch, style) return @last_style if @should_stop @should_stop = (ch == @stop_element) @last_style = ch if ch.name == style end @last_style end end
If this approach does not scale well, then you can optimize things by rewriting recursive_search so that you don’t use recursion, as well as transfer both the styles you are looking for and keep track of the last ones found, so you don’t have to go through the tree for extra time.
I will also say that I am trying to neutralize the Node monkey when the document is being processed, but it looks like it is all written in C. Perhaps you might be better off using something other than Nokogiri that has (possibly REXML ) , or if speed is your real problem, do a search in C / C ++ using Xerces or similar. I don’t know how well they will handle HTML analysis.
Aaron hinni
source share