Ruby scan regex

Question

Ruby scan regex

I am trying to break the line:

"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

into the following array:

 [ ["test","blah"] ["foo","bar bar bar"] ["test","abc","123","456 789"] ]

I tried the following, but this is not entirely correct:

 "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]" .scan(/\[(.*?)\s*\|\s*(.*?)\]/) # => # [ # ["test", "blah"] # ["foo", "bar bar bar"] # ["test", "abc |123 | 456 789"] # ]

I need to divide into each pipe instead of the first. What will be the correct regular expression to achieve this?

+4

ruby regex

Ryan king Mar 30 '13 at 13:48

source share

4 answers

Two alternatives:

 s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]" s.split(/\s*\n\s*/).map{ |p| p.scan(/[^|\[\]]+/).map(&:strip) } #=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]] irb> s.split(/\s*\n\s*/).map do |line| line.sub(/^\s*\[\s*/,'').sub(/\s*\]\s*$/,'').split(/\s*\|\s*/) end #=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Both of them begin with line breaks (discarding surrounding spaces).

The first breaks each fragment, looking for everything that is not [ , | or ] , and then throws extra spaces (calls strip for each).

Then the second discards the leading [ and the ending ] (with a space), and then splits into | (with a space).

You cannot get the final result you want with a single scan . About the near you can get this:

 s.scan /\[(?:([^|\]]+)\|)*([^|\]]+)\]/ #=> [["test", " blah"], ["foo ", "bar bar bar"], ["123 ", " 456 789"]]

... that conveys information, or this:

 s.scan /\[((?:[^|\]]+\|)*[^|\]]+)\]/ #=> [["test| blah"], ["foo |bar bar bar"], ["test| abc |123 | 456 789"]]

... which captures the contents of each "array" as one capture, or this:

 s.scan /\[(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?([^|\]]+)\]/ #=> [["test", nil, nil, " blah"], ["foo ", nil, nil, "bar bar bar"], ["test", " abc ", "123 ", " 456 789"]]

... which is hard-coded for a maximum of four elements and inserts nil entries that you will need .compact .

It is not possible to use Ruby scan to take a regular expression, for example /(?:(aaa)b)+/ , and get a few captures each time a repeat is performed.

+6

Phrogz Mar 30 '13 at 13:59

source share

Why is there a hard way (one regex)? Why not a simple combination of splits? Here are the steps to visualize the process.

 str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]" arr = str.split("\n").map(&:strip) # => ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"] arr = arr.map{|s| s[1..-2] } # => ["test| blah", "foo |bar bar bar", "test| abc |123 | 456 789"] arr = arr.map{|s| s.split('|').map(&:strip)} # => [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

This is probably much less efficient than scan , but at least simple :)

+2

Sergio Tulentsev Mar 30 '13 at 14:03

source share

"Scan, split, stall and delete" Train Damage

This whole premise seems erroneous because it assumes that you will always find alternation in your sub-arrays and that expressions will not contain character classes. However, if this is a problem that you really want to solve, then this should do it.

Firstly, str.scan( /\[.*?\]/ ) will contain three array elements, each of which contains pseudo-arrays. Then you match the subarrays, separating the interlace character. Each submatrix element is then stripped of blanks, and the square brackets are removed. For instance:

 str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]" str.scan( /\[.*?\]/ ).map { |arr| arr.split('|').map { |m| m.strip.delete '[]' }} #=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Detail step by step

Comparing nested arrays is not always intuitive, so I unwound the train-wreck above into more procedural code for comparison. The results are identical, but the following may be easier to reason.

 string = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]" array_of_strings = string.scan( /\[.*?\]/ ) #=> ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"] sub_arrays = array_of_strings.map { |sub_array| sub_array.split('|') } #=> [["[test", " blah]"], # ["[foo ", "bar bar bar]"], # ["[test", " abc ", "123 ", " 456 789]"]] stripped_sub_arrays = sub_arrays.map { |sub_array| sub_array.map(&:strip) } #=> [["[test", "blah]"], # ["[foo", "bar bar bar]"], # ["[test", "abc", "123", "456 789]"]] sub_arrays_without_brackets = stripped_sub_arrays.map { |sub_array| sub_array.map {|elem| elem.delete '[]'} } #=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

+2

Todd A. Jacobs Mar 30 '13 at 15:29

source share

matt · Accepted Answer · 2013-03-30T14:02:18+0000

  s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]" arr = s.scan(/\[(.*?)\]/).map {|m| m[0].split(/ *\| */)}

Ruby scan regex

"Scan, split, stall and delete" Train Damage

Detail step by step

More articles: