Perl Separation Function - Use repeated characters as a delimiter

Question

Perl Separation Function - Use repeated characters as a delimiter

I want to split the string using repeated letters as a separator, for example, "123aaaa23a3" should be split as ('123', '23a3') , while "123abc4" should be left unchanged.
So I tried this:

 @s = split /([[:alpha:]])\1+/, '123aaaa23a3';

But this returns '123', 'a', '23a3' , which I did not want. Now I know that this is because the last 'a' in 'aaaa' captured by brackets and thus split() saved. But in any case, I can’t add something like ?: Since [[:alpha:]] must be captured for a backlink. How can I solve this situation?

+7

regex perl

AaronS 21 sept '15 at 3:19

source share

3 answers

One solution would be to use your original split call and throw away all other values. Conveniently, List::Util::pairkeys is a function that stores the first of each pair of values in its input list:

 use List::Util 1.29 qw( pairkeys ); my @vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';

gives

 Odd number of elements in pairkeys at (eval 6) line 1. [ '123', '23a3' ]

This warning occurs because pairkeys wants to have a dimensional size. We can solve this by adding another value to the end:

 my @vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;

As an alternative, and perhaps a little neat, you should add this extra value at the top of the list and use pairvalues :

 use List::Util 1.29 qw( pairvalues ); my @vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';

+2

Leonerd 21 sept '15 at 10:31

source share

Split can be made to work directly, using a delayed expression (a deferred regular subexpression), (??{ code }) in the regular expression:

 @s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';

(??{ code }) documented on the perlre manual page.

Note that, according to the perlvar man page, using $& anywhere in the program imposes a significant performance hit on all regular expression matches. I never found this to be a problem, but YMMV.

0

pjh Sep 28 '15 at 19:23

source share

Sobrique · Accepted Answer · 2015-09-21T09:45:10+0000

Hmm, its interesting. My first thought would be - your separator will always be an odd number, so you can simply discard any elements with an odd number.

Is something like this possible ?:

 my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' ); print Dumper \%s;

This will give you:

 $VAR1 = { '23a3' => '', '123' => 'a' };

So you can extract your template through keys .

Unfortunately, my second approach of "selecting" pattern matches through %+ does not particularly help (separation does not fill in regular expression stuff).

But something like this:

 my @delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g; print Dumper \%+;

Using the named capture, we identify that a is from the capture group. Unfortunately, this does not look like padding when you do it through split - which can lead to a two-pass approach.

This is the closest I got:

 #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my $str = '123aaaa23a3'; #build a regex out of '2-or-more' characters. my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g); #make the regex non-capturing $regex = qr/(?:$regex)/; print "Using: $regex\n"; #split on the regex my @s = split m/$regex/, $str; print Dumper \@s;

First, we process the string to extract character patterns of “2 or more” to set as our dividers. Then we collect the regular expression from them using non-capture, so we can split.

Perl Separation Function - Use repeated characters as a delimiter

More articles: