I have an XML document with the following format:
<document> <page> <column> <text> <par> <line></line> </par> </text> </column> </page> </document>
I want to search for a string in XML, but can be in several tags, several block documents and / or several tags:
<document> <page> <column> <text> <par> <line>Hello</line> </par> </text> </column> <column> <text> <par> <line>World</line> </par> </text> </column> </page> <page> <column> <text> <par> <line>What's</line> <line>Up?</line> </par> </text> </column> </page> </document>
Do I need to find "Hello World What Up"? and know that it is in row 1 of column 1, row 1 of column 2 and row 1-2 of block 3 (page 3, block 1).
I have metadata on the rows to indicate me the row number, as well as the number of the column to which it belongs, for example:
<line linenum="1" columnnum="2">World</line>
What will be the best way to search for this term in different columns and can you find out more about which rows and columns they belong to?
I can get all instances of the first word, iterate over each of them and see if the following words match the search words (word by word), and if there are no more words in this line, go to the next line. If there are no more lines, go to the next block. Thoughts?
Here is the real snippet of sample XML code and the returned script:
<block> <text> <par> <line colnum="1" linenum="1"> (12) United States Patent </line> </par> <par> <line colnum="1" linenum="2"> Kar-Roy et al. </line> </par> </text> </block> <block> <text> <par> <line colnum="2" linenum="3"> US007078310B1 </line> </par> </text> </block> <block> <text> <par> <line colnum="3" linenum="4"> (io) Patent No.: US 7,078,310 B1 </line> </par> <par> <line colnum="3" linenum="5"> (45) Date of Patent: Jul. 18,2006 </line> </par> </text> </block> <block> <text> <par> <line> (54) METHOD FOR FABRICATING A HIGH </line> <line> DENSITY COMPOSITE MIM CAPACITOR </line> </par> </text> </block>
When I search for "METHOD FOR HIGH-TISSUE", map{|f| f.text} map{|f| f.text} returns:
["Kar-Roy et al.", "US007078310B1", "(io) Patent No.: US 7,078,310 B1", "(45) Date of Patent: Jul. 18,2006", "(54) METHOD FOR FABRICATING A HIGH"]
It seems like it takes a length of five words and for some reason gets four lines before the actual result.