How can I index a row using its preprocessed indexes?

I have a definition string in which HTML can be displayed, and an array of words. I try to find these words in definition and return the start and end positions. For example, I could find "Hello" in:

 definition = "<strong>Hel</strong>lo World!" 

Getting rid of HTML can be done using sanitize from ActionView and HTMLEntities , but this changes the "Hello" index on the line, so:

 sanitized_definition.index("Hello") 

will return 0 . I need start point 8 and end point 21 . I was thinking of matching the entire row with my own indices, e.g.

 {"1" => '<', "2" => 's', "3" => 't', .. , "9" => 'H' ...} 

so 1 maps to the first character, 2 to the second and so on, but I'm not sure what this does, and it seems too complicated. Does anyone have any ideas how to do this?

EDIT:

A good point in the comments is that it doesn’t make sense that I want to include </strong> but not <strong> at the beginning, partly because I didn’t understand what to do with this edge case. For the purposes of this question, a better example might be sort of

 definition = "Probati<strong>onary Peri</strong>od." search_text = 'Probationary Period' 

Also, having thought about this a little more, I think that in my particular case, the only html structure I need to worry about is &nbsp; .

+4
source share
1 answer

I confess that I don’t know much about HTML. I assumed that each adjacent pair of letters of the target word (here “Hello”) is separated by zero or more lines enclosed in brackets < and > , and nothing more (but I don’t know if this is correct).

 def doit(str, word) r = Regexp.new(word.chars.join('(?:<.*?>)*')) ndx = str.index(r) ndx ? [ndx, ndx+str[r].size-1] : nil end doit "<strong>Hel</strong>lo World!", "Hello" #=> [8,21] 

Here's what happens:

 str = "<strong>Hel</strong>lo World!" word = "Hello" a = word.chars #=> ["H", "e", "l", "l", "o"] s = a.join('(?:<.*?>)*') #=> "H(?:<.*?>)*e(?:<.*?>)*l(?:<.*?>)*l(?:<.*?>)*o" r = Regexp.new(s) #=> /H(?:<.*?>)*e(?:<.*?>)*l(?:<.*?>)*l(?:<.*?>)*o/ ndx = str.index(r) #=> 8 t = str[r] #=> "Hel</strong>lo" o = t.size-1 #=> 13 ndx ? [ndx, ndx+str[r].size-1] : nil #=> 8 ? [8, 8 + t.size-1] : nil #=> [8, 8 + 14 -1] #=> [8, 21] 
+4
source

All Articles