Perl regex for extracting multi-line blocks

Question

Perl regex for extracting multi-line blocks

I have the text as follows:

00:00 stuff 00:01 more stuff multi line and going 00:02 still have

So, I don't have the end of the block, just starting a new block.

I want to recursively get all the blocks:

 1 = 00:00 stuff 2 = 00:01 more stuff multi line and going

etc.

Below is the code below:

 $VAR1 = '00:00'; $VAR2 = ''; $VAR3 = '00:01'; $VAR4 = ''; $VAR5 = '00:02'; $VAR6 = '';

What am I doing wrong?

 my $text = '00:00 stuff 00:01 more stuff multi line and going 00:02 still have '; my @array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms; print Dumper(@array);

+8

regex perl

cristi May 14 '12 at 12:28

source share

3 answers

Version 5.10.0 introduces capture group names that are useful for matching non-trivial patterns.

(?'NAME'pattern)
(?<NAME>pattern)
Named capture group. In all respects, it is identical to the usual parentheses () , but for the additional fact that a group can be called by name in various regular expression constructs (for example, \g{NAME} ), and they can be accessed by name after a successful match with %+ or %- . See perlvar for %+ and %- hashes for more details.
If several different capture groups have the same name, then $+{NAME} will refer to the leftmost defined group in the match.
The shapes (?'NAME'pattern) and (?<NAME>pattern) equivalent.

Named capture groups allow us to specify subpatterns in a regular expression, as shown below.

 use 5.10.0; # named capture buffers my $block_pattern = qr/ (?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc)) (?(DEFINE) # timestamp at logical beginning-of-line (?<_time> (?m:^) [0-9][0-9]:[0-9][0-9]) # runs of spaces or tabs (?<_sp> [ \t]+) # description is everything through the end of the record (?<_desc> # s switch makes . match newline too (?s: .+?) # terminate before optional whitespace (which we remove) followed # by either end-of-string or the start of another block (?= (?&_sp)? (?: $ | (?&_time))) ) ) /x;

Use it as in

 my $text = '00:00 stuff 00:01 more stuff multi line and going 00:02 still have '; while ($text =~ /$block_pattern/g) { print "time=[$+{time}]\n", "desc=[[[\n", $+{desc}, "]]]\n\n"; }

Output:

  $ ./blocks-demo
 time = [00:00]
 desc = [[[
 stuff
 ]]]

 time = [00:01]
 desc = [[[
 more stuff
 multi line
  and going
 ]]]

 time = [00:02]
 desc = [[[
 still
 have
 ]]]

+4

Greg bacon May 14, '12 at 13:26

source share

Your problem is that .*? not greedy just like .* is greedy. When it is not forced, it matches as little as possible, which in this case is an empty string.

So, you will need something after a non-greedy match to snap your grip. I came up with this regex:

 my @array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;

As you can see, I removed the /m option to be exactly able to match the end of the line while waiting.

You can also consider this solution:

 my @array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

0

TLP May 14 '12 at 12:42

source share

tuxuday · Accepted Answer · 2012-05-14T12:42:41+0000

That should do the trick. The beginning of the following \ d \ d: \ d \ d is considered the end of the block.

 $Str = '00:00 stuff 00:01 more stuff multi line and going 00:02 still have 00:03 still have' ; @Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs); print join "--\n", @Blocks;

Perl regex for extracting multi-line blocks

More articles: