The mystery of regex

I am learning the secret of regexp. I'm tired, so I may miss something obvious - but I see no reason for this.

In the examples below, I use perl - but I saw it for the first time in VIM, so I assume this is due to more than one regexp engine.

Suppose we have this file:

$ cat data 1 =2 3 =4 5 =6 7 =8 

Then we can remove the spaces before the '=' character with ...

 $ cat data | perl -ne 's,(.)\s+=(.),\1=\2,g; print;' 1=2 3=4 5=6 7=8 

Note that in each row, all instances of the match are replaced; we used the / g search modifier, which does not stop at the first replacement, and instead goes to the replacement until the end of the line.

For example, both the space before '= 2' and the space before '= 4' were deleted; on the same line.

Why not use simpler constructs like 's, =, =, g'? Well, we were preparing for more complex scenarios ... where the right side of the assignments are quoted strings and can be either single or double quotes:

 $ cat data2 1 ="2" 3 ='4 =' 5 ='6' 7 ="8" 

To do the same job (remove the space before the equal sign), we must be careful, because the lines can contain an equal sign - therefore, we mark the first quote that we see and look for it through the backlinks:

 $ cat data2 | perl -ne 's,(.)\s+=(.)([^\2]*)\2,\1=\2\3\2,g; print;' 1="2" 3='4 =' 5='6' 7="8" 

We used the \ 2 backlink to search for something that is not the same quote as the one we first saw, as many times as you like ([^ \ 2] *). Then we searched for the most original quote (\ 2). If found, we used backlinks to refer to the relevant parts in the target replacement.

Now look at this:

 $ cat data3 posAndWidth ="40:5 =" height ="1" posAndWidth ="-1:8 ='" textAlignment ="Right" 

What we want here is to remove the last space character that exists before all instances of '=' in each line. As before, we cannot use the simple 's, = ", =", g', because the lines themselves may contain an equal sign.

So, we follow the same pattern as above and use backlinks:

 $ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,g; print;" posAndWidth="40:5 =" height ="1" posAndWidth="-1:8 ='" textAlignment ="Right" 

It works ... but only in the first match of the line! The space following "textAlignment" was not deleted, and none of them were on top of it ("height").

Basically, it seems that / g no longer works: running the same replace command without / g produces exactly the same result:

 $ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,; print;" posAndWidth="40:5 =" height ="1" posAndWidth="-1:8 ='" textAlignment ="Right" 

It appears that / g is ignored in this regular expression. Any ideas why?

+4
source share
2 answers

I will share my comment on the TLP answer:

ttsiodras you ask two questions:

1- Why doesn't your regular expression give the desired result? why does the g flag not work?

The answer is that your regular expression contains this part [^\3] , which is not processed correctly: \3 not recognized as a backward link. I searched for it but couldn't find a way to have a backlink in the character class.

2- how do you remove the space preceding the equal sign and leave alone the part that comes after and is between quotation marks?

This will be the way to do this (see this link ):

 $ cat data3 | perl -pe "s,(([\"']).*?\2)| (=),\1\3,g" posAndWidth="40:5 =" height ="1" posAndWidth="-1:8 ='" textAlignment="Right" 

The first part of the regular expression catches everything that is between quotation marks (one or two) and is replaced by a match, the second part corresponds to the equal sign preceded by the space you are looking for. Please note that this solution is just a work around the "interesting" part about the symbol class of the complement class with backward link [^\3] using the inanimate *? operator *?


Finally, if you want to pursue a negative search solution :

 $ cat data3 | perl -pe 's,(\w+)(\s*) =(["'"'"'])((?:(?!\3).)*)\3,\1\2=\3\4\3,g' posAndWidth="40:5 =" height ="1" posAndWidth="-1:8 ='" textAlignment="Right" 

The quotation mark between the square brackets still means "[\"']" , but I had to use single quotes around the entire perl command, otherwise the negative syntax (?!...) return an error in bash.

EDIT Regular expression with negative lookup fixed: pay attention to low-fat operator *? again *? and flag g .

EDIT Ttsiodras took the comment into account: deleted the inanimate operator.

EDIT Consider TLP Comment

+1
source

Inserting some debugging symbols into your wildcard sheds light on the problem:

 use strict; use warnings; while (<DATA>) { s,(\w+)(\s*) =(['"])([^\3]*)\3,$1$2=$3<$4>$3,g; print; # here -^ -^ } __DATA__ posAndWidth ="40:5 =" height ="1" posAndWidth ="-1:8 ='" textAlignment ="Right" 

Output:

 posAndWidth="<40:5 =" height ="1>" posAndWidth="<-1:8 ='" textAlignment ="Right>" # ^--------- match ---------------^ 

Note that the match goes through both quotes at the same time. It would seem that [^\3]* does not do what you think.

Regex is not the best tool here. Use a parser that can process Text::ParseWords strings, for example Text::ParseWords :

 use strict; use warnings; use Data::Dumper; use Text::ParseWords; while (<DATA>) { chomp; my @a = quotewords('\s+', 1, $_); print Dumper \@a; print "@a\n"; } __DATA__ posAndWidth ="40:5 =" height ="1" posAndWidth ="-1:8 ='" textAlignment ="Right" 

Output:

 $VAR1 = [ 'posAndWidth', '="40:5 ="', 'height', '="1"' ]; posAndWidth ="40:5 =" height ="1" $VAR1 = [ 'posAndWidth', '="-1:8 =\'"', 'textAlignment', '="Right"' ]; posAndWidth ="-1:8 ='" textAlignment ="Right" 

I have included Dumper output so you can see how the lines are split.

+3
source

All Articles