Regular Expression Analysis Using Nokogiri

Question

Regular Expression Analysis Using Nokogiri

Using Nokogiri, I need to analyze this block:

<div class="some_class"> 12 AB / 4+ CD <br/> 2,600 Dollars <br/> </div>

I need to get the ab , cd and dollars values if they exist.

 ab = p.css(".some_class").text[....some regex....] cd = p.css(".some_class").text[....some regex....] dollars = p.css(".some_class").text[....some regex....]

It is right? If so, can someone help me with a regex for parsing the values of ab , cd and dollars ?

+4

ruby regex nokogiri

There Are Four Lights Jul 17 '10 at 18:07

source share

2 answers

This is an older thread, but I just stumbled upon it. Here, how would I find the values and a convenient way to store the values:

 require "ap" require "nokogiri" xml = <<EOT <div class="some_class"> 12 AB / 4+ CD <br/> 2,600 Dollars <br/> </div> EOT doc = Nokogiri::XML(xml) some_class = doc.at('.some_class').text values = some_class .scan(/([\d+]+) ([az,]+)/i) .each_with_object({}){ |(v,c), h| h[c] = v.to_i } values # => {"AB"=>12, "CD"=>4, "Dollars"=>600}

0

the tin man Dec 20 '11 at 3:16

source share

mikej · Accepted Answer · 2010-07-17T18:17:34+0000

To get the best answer, you will need to determine exactly in which format the values AB, CD and Dollar are taken, but here is a solution based on the above example. It uses the regexp () grouping to capture the information we are interested in. (For details, see the bottom of the answer)

 text = p.css(".some_class").text # one or more digits followed by a space followed by AB, capture the digits ab = text.match(/(\d+) AB/).captures[0] # => "12" # one of more non digits followed by a literal + followed by CD cd = text.match(/(\d+\+) CD/).captures[0] # => "4+" # digits or commas followed by "Dollars" dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600"

Note that if there is no match, then String#match returns nil , so if the values may not exist, you will need to check, for example.

 if match = text.match(/([\d,]+) Dollars/) dollars = match.captures[0] end

Additional explanation of captures

To match the number of ABs, we need the pattern /\d+ AB/ to identify the correct part of the text. However, we are really only interested in the numerical part, so we surround it with brackets so that we can extract it. eg.

 irb(main):027:0> match = text.match(/(\d+) AB/) => #<MatchData:0x2ca3440> # the match method returns MatchData if there is a match, nil if not irb(main):028:0> match.to_s # match.to_s gives us the entire text that matched the pattern => "12 AB" irb(main):029:0> match.captures => ["12"] # match.captures gives us an array of the parts of the pattern that were enclosed in () # in our example there is just 1 but there could be multiple irb(main):030:0> match.captures[0] => "12" # the first capture - the bit we want

Take a look at the documentation for MatchData in particular captures for more details.

Regular Expression Analysis Using Nokogiri

More articles: