Regular Expression Analysis Using Nokogiri
Using Nokogiri, I need to analyze this block:
<div class="some_class"> 12 AB / 4+ CD <br/> 2,600 Dollars <br/> </div> I need to get the ab , cd and dollars values ββif they exist.
ab = p.css(".some_class").text[....some regex....] cd = p.css(".some_class").text[....some regex....] dollars = p.css(".some_class").text[....some regex....] It is right? If so, can someone help me with a regex for parsing the values ββof ab , cd and dollars ?
To get the best answer, you will need to determine exactly in which format the values ββAB, CD and Dollar are taken, but here is a solution based on the above example. It uses the regexp () grouping to capture the information we are interested in. (For details, see the bottom of the answer)
text = p.css(".some_class").text # one or more digits followed by a space followed by AB, capture the digits ab = text.match(/(\d+) AB/).captures[0] # => "12" # one of more non digits followed by a literal + followed by CD cd = text.match(/(\d+\+) CD/).captures[0] # => "4+" # digits or commas followed by "Dollars" dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600" Note that if there is no match, then String#match returns nil , so if the values ββmay not exist, you will need to check, for example.
if match = text.match(/([\d,]+) Dollars/) dollars = match.captures[0] end Additional explanation of captures
To match the number of ABs, we need the pattern /\d+ AB/ to identify the correct part of the text. However, we are really only interested in the numerical part, so we surround it with brackets so that we can extract it. eg.
irb(main):027:0> match = text.match(/(\d+) AB/) => #<MatchData:0x2ca3440> # the match method returns MatchData if there is a match, nil if not irb(main):028:0> match.to_s # match.to_s gives us the entire text that matched the pattern => "12 AB" irb(main):029:0> match.captures => ["12"] # match.captures gives us an array of the parts of the pattern that were enclosed in () # in our example there is just 1 but there could be multiple irb(main):030:0> match.captures[0] => "12" # the first capture - the bit we want Take a look at the documentation for MatchData in particular captures for more details.
This is an older thread, but I just stumbled upon it. Here, how would I find the values ββand a convenient way to store the values:
require "ap" require "nokogiri" xml = <<EOT <div class="some_class"> 12 AB / 4+ CD <br/> 2,600 Dollars <br/> </div> EOT doc = Nokogiri::XML(xml) some_class = doc.at('.some_class').text values = some_class .scan(/([\d+]+) ([az,]+)/i) .each_with_object({}){ |(v,c), h| h[c] = v.to_i } values # => {"AB"=>12, "CD"=>4, "Dollars"=>600}