XML string search

I have an XML document with the following format:

<document> <page> <column> <text> <par> <line></line> </par> </text> </column> </page> </document> 

I want to search for a string in XML, but can be in several tags, several block documents and / or several tags:

 <document> <page> <column> <text> <par> <line>Hello</line> </par> </text> </column> <column> <text> <par> <line>World</line> </par> </text> </column> </page> <page> <column> <text> <par> <line>What's</line> <line>Up?</line> </par> </text> </column> </page> </document> 

Do I need to find "Hello World What Up"? and know that it is in row 1 of column 1, row 1 of column 2 and row 1-2 of block 3 (page 3, block 1).

I have metadata on the rows to indicate me the row number, as well as the number of the column to which it belongs, for example:

 <line linenum="1" columnnum="2">World</line> 

What will be the best way to search for this term in different columns and can you find out more about which rows and columns they belong to?

I can get all instances of the first word, iterate over each of them and see if the following words match the search words (word by word), and if there are no more words in this line, go to the next line. If there are no more lines, go to the next block. Thoughts?

Here is the real snippet of sample XML code and the returned script:

 <block> <text> <par> <line colnum="1" linenum="1"> (12) United States Patent </line> </par> <par> <line colnum="1" linenum="2"> Kar-Roy et al. </line> </par> </text> </block> <block> <text> <par> <line colnum="2" linenum="3"> US007078310B1 </line> </par> </text> </block> <block> <text> <par> <line colnum="3" linenum="4"> (io) Patent No.: US 7,078,310 B1 </line> </par> <par> <line colnum="3" linenum="5"> (45) Date of Patent: Jul. 18,2006 </line> </par> </text> </block> <block> <text> <par> <line> (54) METHOD FOR FABRICATING A HIGH </line> <line> DENSITY COMPOSITE MIM CAPACITOR </line> </par> </text> </block> 

When I search for "METHOD FOR HIGH-TISSUE", map{|f| f.text} map{|f| f.text} returns:

 ["Kar-Roy et al.", "US007078310B1", "(io) Patent No.: US 7,078,310 B1", "(45) Date of Patent: Jul. 18,2006", "(54) METHOD FOR FABRICATING A HIGH"] 

It seems like it takes a length of five words and for some reason gets four lines before the actual result.

+4
source share
2 answers

Here's my thought: first, parse your structure in an XML parser like Nokogiri, and then use XPath search to extract all line elements. Then break each element into words contained in this node, so we can match phrases that match only part of the node. Then order the words sequentially, use each_cons(4) (where 4 is the number of words you are looking for) to look at all consecutive sets of four words and return if they match your search string when concatenating. Here is my code for this:

 xml = Nokogiri::XML.parse(doc) search = "HIGH DENSITY" # 1. break down all the lines into words tagged with their nodes # 2. find matching subsequence # 3. build up from nodes nodes = xml.xpath('//line') words = nodes.map do |n| words_in_node = n.text.split(' ').map(&:upcase) # split into words and normalize words_in_node.map { |word| { word: word, node: n } } end words = words.flatten # at this point we have a single, ordered list like [ {word: "foo", node: ...}, {word: "bar", node: ...} ] keywords = search.split(' ').map(&:upcase) result = words.each_cons(keywords.size).find do |sample| # Extract just the :word key from each hash, then compare to our search string sample_words = sample.map { |w| w[:word] } sample_words == keywords end if result puts "Found in these nodes:" puts result.map { |w| w[:node] }.uniq.inspect # you can find where each node was located via Nokogiri else puts "No match" end 

What produces:

 Found in these nodes: [#<Nokogiri::XML::Element:0x4ea323e name="line" children=[#<Nokogiri::XML::Text:0x4ea294c "\n (54) METHOD FOR FABRICATING A HIGH\n ">]>, #<Nokogiri::XML::Element:0x4ea3018 name="line" children=[#<Nokogiri::XML::Text:0x4ea2654 "\n DENSITY COMPOSITE MIM CAPACITOR\n ">]>] 
+2
source

If I understand what you want, I would do it like this:

 require 'nokogiri' doc = Nokogiri::XML(<<EOT) <document> <page> <column> <text> <par> <line linenum="1" columnnum="1">Hello</line> </par> </text> </column> <column> <text> <par> <line linenum="1" columnnum="2">World</line> </par> </text> </column> </page> <page> <column> <text> <par> <line linenum="1" columnnum="3">What's</line> <line linenum="2" columnnum="3">Up?</line> </par> </text> </column> </page> </document> EOT line_text = doc.search('column').map { |column| column.search('line').map{ |line| { line: line['linenum'], column: line['columnnum'], text: line.text } } } 

At this point, line_text contains:

 line_text # => [[{:line=>"1", :column=>"1", :text=>"Hello"}], # [{:line=>"1", :column=>"2", :text=>"World"}], # [{:line=>"1", :column=>"3", :text=>"What's"}, # {:line=>"2", :column=>"3", :text=>"Up?"}]] 

This is a <column> grouping. Metadata is not needed, but convenient if it exists in XML. If this is not the case, delete the lines to capture these parameters and just return the text:

 line_text = doc.search('column').map { |column| column.search('line').map{ |line| line.text } } line_text # => [["Hello"], ["World"], ["What's", "Up?"]] 

line_text now an array of arrays. Each element in the external array represents a column, and the elements inside this submatrix are rows, so you can keep track of such things with a much smaller returned array, along with a small amount of additional code:

 line_text.each.with_index(1) do |column, column_num| column.each.with_index(1) do |text, line_num| puts "column: #{column_num} line: #{line_num} text: #{text}" end end # >> column: 1 line: 1 text: Hello # >> column: 2 line: 1 text: World # >> column: 3 line: 1 text: What's # >> column: 3 line: 2 text: Up? 
+1
source

All Articles