Understanding Perl / m and / s Regular Expression Modifiers

Question

Understanding Perl / m and / s Regular Expression Modifiers

I read the perl regex with the sm and g modifier. I understand that // g is a global mapping, where it will be a greedy search.

But they confuse me with the s and m modifier. Can someone explain the difference between s and m with code example to show how it can be different? I tried to search the web and it only gives an explanation, as in the link http://perldoc.perl.org/perlre.html#Modifiers . In stackoverflow, I even saw people using s and m together. Isn't s the opposite of m?

//s //m //g

I cannot match multiple lines using m.

 use warnings; use strict; use 5.012; my $file; { local $/ = undef; $file = <DATA>; }; my @strings = $file =~ /".*"/mg; #returns all except the last string across multiple lines #/"String"/mg; tried with this as well and returns nothing except String say for @strings; __DATA__ "This is string" "1!=2" "This is \"string\"" "string1"."string2" "String" "S t r i n g"

+7

regex perl

user2763829 Apr 9 '14 at 12:29

source share

4 answers

With /".*"/mg your /".*"/mg

starts with "
and then .*" matches as much as possible any character (except \n ) to "
since you use /g and the match is stopped in the second, " regex will try to repeat the first two steps
/m doesn't matter here since you are not using ^ or $ bindings

Since you avoid quotes in your example, regex is not the best tool to accomplish what you want. If this is not the case, and you need everything between the two quotation marks, /".*?"/gs will do the job.

+4

Dry27 Apr 9 '14 at 13:31

source share

/m and /s affect how the matching operator processes multi-line strings.

Using the modifier, /m ^ and $ correspond to the beginning and end of any line inside the line. Without the modifier, /m ^ and $ simply correspond to the beginning and end of the line.

Example:

 $_ = "foo\nbar\n"; /foo$/, /^bar/ do not match /foo$/m, /^bar/m match

With modifier /s special character . matches all characters, including newlines. Without modifier /s . matches all characters except newlines.

 $_ = "cat\ndog\ngoldfish"; /cat.*fish/ does not match /cat.*fish/s matches

You can use the /sm modifiers.

 $_ = "100\n101\n102\n103\n104\n105\n"; /^102.*104$/ does not match /^102.*104$/s does not match /^102.*104$/m does not match /^102.*104$/sm matches

+3

mob Apr 9 '14 at 14:36

source share

Borodin's regular expression will work for examples from this lab.

However, it is also possible for the backslash to escape itself. This happens when one includes the Windows paths in the string, so the following regex will catch this case:

 use warnings; use strict; use 5.012; my $file = do { local $/; <DATA>}; my @strings = $file =~ /"(?:(?>[^"\\]+)|\\.)*"/g; say "<$_>" for @strings; __DATA__ "This is string" "1!=2" "This is \"string\"" "string1"."string2" "String" "S t r i n g" "C:\\windows\\style\\path\\" "another string"

Outputs:

 <"This is string"> <"1!=2"> <"This is \"string\""> <"string1"> <"string2"> <"String"> <"S t r i n g"> <"C:\\windows\\style\\path\\"> <"another string">

For a quick explanation of the template:

 my @strings = $file =~ m{ " (?: (?> # Independent subexpression (reduces backtracking) [^"\\]+ # Gobble all non double quotes and backslashes ) | \\. # Backslash followed by any character )* " }xg; # /x modifier allows whitespace and comments.

+1

Miller Apr 9 '14 at 18:39

source share

Borodin · Accepted Answer · 2014-04-09T12:35:59+0000

The documentation that you associate with me seems very understandable to me. This would help if you explained what the problem was with understanding, and how you came to the conclusion that /s and /m are opposites.

In short, /s changes the behavior of the point metacharacter . so that it matches any character in general. Usually it matches anything other than the newline "\n" , and therefore treats the line as a string s , even if it contains newline characters.

/m changes the carriage ^ and dollar $ metacharacters to match the newline characters within the line, treating it as line m . Usually they will only match at the beginning and end of the line.

You should not be confused with the /g modifier, which is greedy. This is for g lobal matches that will find all occurrences of the pattern inside the string. The term “greedy” is usually the user for the behavior of quantifiers in a pattern. For example,. .*? called greedy because it will match as many characters as possible, unlike .*? which will match as few characters as possible.

Update

In the modified question, you use /".*"/mg , in which /m does not matter, since, as noted above, the modifier only changes the behavior of the $ and ^ metacharacters, and they are not in your template.

Changing it to /".*"/sg slightly improves the situation when . can now match a new line at the end of each line, and therefore the pattern can match multi-line lines. (Note that here the line of the object is considered to be "one line", that is, Coincidence behaves as if there were no lines in the new line, if it comes to . ). This is the traditional meaning of greedy, because the pattern now matches all values from the first double quote in the first line to the last double quote at the end of the last line. I guess this is not what you want.

There are several ways to fix this. I recommend changing your template so that the desired string is a double quote, followed by any sequence of characters except double quotes, followed by another double quote. It is written /"[^"]*"/g (note that the /s modifier is no longer needed since there are no dots in the template) and it almost does what you want, except that escaped double quotes are considered as the end of the template.

Take a look at this program and its output, noting that I put chevron >> at the beginning of each match so that they can be distinguished

 use strict; use warnings; my $file = do { local $/; <DATA>; }; my @strings = $file =~ /"[^"]*"/g; print ">> $_\n\n", for @strings; __DATA__ "This is string" "1!=2" "This is \"string\"" "string1"."string2" "String" "S t r i n g"

Exit

 >> "This is string" >> "1!=2" >> "This is \" >> "" >> "string1" >> "string2" >> "String" >> "S t r i n g"

As you can see, everything is in order, except that two matches were found in "This is \"string\"" , "This is \" and "" . A fix that may be harder than you want, but it is possible. Please tell me if you need this too.

Update

I can also finish this. To ignore escaped double quotes and consider them as part of a string, we need to accept either \" or any character other than double quotes. This is done using the regex operator | and it must be grouped inside non-capturing parentheses (?: ... ) . The end result is /"(?:\\"|[^"])*"/g (the backslash itself must be escaped, so it doubles), which when entered into the above program produces this output, which , as I suppose, is that you wanted.

 >> "This is string" >> "1!=2" >> "This is \"string\"" >> "string1" >> "string2" >> "String" >> "S t r i n g"

Understanding Perl / m and / s Regular Expression Modifiers

More articles: