12 AB / 4+ CD
2,600 Dolla...">

Regular Expression Analysis Using Nokogiri

Using Nokogiri, I need to analyze this block:

<div class="some_class"> 12 AB / 4+ CD <br/> 2,600 Dollars <br/> </div> 

I need to get the ab , cd and dollars values ​​if they exist.

 ab = p.css(".some_class").text[....some regex....] cd = p.css(".some_class").text[....some regex....] dollars = p.css(".some_class").text[....some regex....] 

It is right? If so, can someone help me with a regex for parsing the values ​​of ab , cd and dollars ?

+4
source share
2 answers

To get the best answer, you will need to determine exactly in which format the values ​​AB, CD and Dollar are taken, but here is a solution based on the above example. It uses the regexp () grouping to capture the information we are interested in. (For details, see the bottom of the answer)

 text = p.css(".some_class").text # one or more digits followed by a space followed by AB, capture the digits ab = text.match(/(\d+) AB/).captures[0] # => "12" # one of more non digits followed by a literal + followed by CD cd = text.match(/(\d+\+) CD/).captures[0] # => "4+" # digits or commas followed by "Dollars" dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600" 

Note that if there is no match, then String#match returns nil , so if the values ​​may not exist, you will need to check, for example.

 if match = text.match(/([\d,]+) Dollars/) dollars = match.captures[0] end 

Additional explanation of captures

To match the number of ABs, we need the pattern /\d+ AB/ to identify the correct part of the text. However, we are really only interested in the numerical part, so we surround it with brackets so that we can extract it. eg.

 irb(main):027:0> match = text.match(/(\d+) AB/) => #<MatchData:0x2ca3440> # the match method returns MatchData if there is a match, nil if not irb(main):028:0> match.to_s # match.to_s gives us the entire text that matched the pattern => "12 AB" irb(main):029:0> match.captures => ["12"] # match.captures gives us an array of the parts of the pattern that were enclosed in () # in our example there is just 1 but there could be multiple irb(main):030:0> match.captures[0] => "12" # the first capture - the bit we want 

Take a look at the documentation for MatchData in particular captures for more details.

+6
source

This is an older thread, but I just stumbled upon it. Here, how would I find the values ​​and a convenient way to store the values:

 require "ap" require "nokogiri" xml = <<EOT <div class="some_class"> 12 AB / 4+ CD <br/> 2,600 Dollars <br/> </div> EOT doc = Nokogiri::XML(xml) some_class = doc.at('.some_class').text values = some_class .scan(/([\d+]+) ([az,]+)/i) .each_with_object({}){ |(v,c), h| h[c] = v.to_i } values # => {"AB"=>12, "CD"=>4, "Dollars"=>600} 
0
source

Source: https://habr.com/ru/post/1316036/


All Articles