To double the escape or not to double the escape in PHP PCRE functions?

I was looking for a solid article when double shielding is needed, and when not, but I could not find anything. Maybe I didn't look hard enough, because I'm sure there is an explanation somewhere, but let's just try to find the next guy who has this question!

Take, for example, the following regex patterns:

/\n/ /domain\.com/ /myfeet \$ your feet/ 

Nothing to break? Ok, let's use these examples in the context of the PHP function preg_match:

 $foo = preg_match("/\n/", $bar); $foo = preg_match("/domain\.com/", $bar); $foo = preg_match("/myfeet \$ your feet/", $bar); 

As far as I understand, the backslash in the context of the quoted value of the string supplants the next character, and the expression is passed using the string value.

Will the previous one seem to do the following, and does this cause an error ?:

 $foo = preg_match("/n/", $bar); $foo = preg_match("/domain.com/", $bar); $foo = preg_match("/myfeet $ your feet/", $bar); 

What I do not want? these expressions do not match, as indicated above.

Wouldn't I have to double-screen them?

 $foo = preg_match("/\\n/", $bar); $foo = preg_match("/domain\\.com/", $bar); $foo = preg_match("/myfeet \\$ your feet/", $bar); 

So, when PHP processes the string, does it skip the backslash to the backslash, which then remains when it is passed to the PCRE interpreter?

Or does PHP just magically know that I want to pass this backslash to the PCRE interpreter ... I mean, how does it know that I'm not trying to \" escape the quote that I want to use in my expression? Or just double slashes required when using a hidden quote? And in this regard, is TRIPLE necessary to avoid the quote? \\\" Do you know that the quote is hidden, but the double remains?

What is a rule of thumb?

I just checked PHP:

 $bar = "asdfasdf a\"ONE\"sfda dsf adsf me & mine adsf asdf asfd "; echo preg_match("/me \$ mine/", $bar); echo "<br /><br />"; echo preg_match("/me \\$ mine/", $bar); echo "<br /><br />"; echo preg_match("/a\"ONE\"/", $bar); echo "<br /><br />"; echo preg_match("/a\\\"ONE\\\"/", $bar); echo "<br /><br />"; 

Output:

 0 1 1 1 

So it looks like it's really not important for quotes, but a double escape is required for the dollar sign, as I thought.

+4
source share
5 answers

Double quotes

When it comes to exiting double quotes, the rule is that PHP will check the character immediately after the backslash.

If the neighboring character is in the set ntrvef\$" or a numerical value follows (the rules can be found here ), it is evaluated as the corresponding control character or ordinal (hexadecimal or octal) representation, respectively.

It is important to note that if an invalid escape sequence is specified, the expression is not evaluated, and the backslash and character remain. This is different from some other languages ​​where an invalid escape sequence causes an error.

eg. "domain\.com" will be left as is.

Note that variables also expand inside double quotes, for example. "$var" must be escaped as "\$var" .

Single quotes

Starting with PHP 5.1.1, any backslash inside single quotes (and at least one character) will be printed as is, and no variables will be replaced. This is by far the most convenient feature of single quotes.

Regular expressions

To speed up regular expressions, it's best to leave escaping at preg_quote() :

 $foo = preg_match('/' . preg_quote('mine & yours', '/') . '/', $bar); 

This way you don't have to worry about which characters you need to escape, so it works well for user input.

See also: preg_quote

Update

You have added this test:

 "/me \$ mine/" 

This evaluates to "/me $ mine/" ; but in PCRE $ has a special meaning (this is the binding of the end of the object).

 "/me \\$ mine/" 

This evaluates to "/me \$ mine/" and therefore the backslash is escaped for PHP itself and $ for PCRE. This only works by accident.

 $var = 'something'; "/me \\$var mine/" 

This evaluates to "/me \something" , so you need to exit $ again.

 "/me \\\$var mine/" 
+7
source

Use single quotes. They prevent escape sequences.

For instance:

 php > print "hi\n"; hi php > print 'hi\n'; hi\nphp > 
+1
source

Whenever you have an invalid escape sequence, PHP actually leaves characters literally in the string. From the doc:

As with single quotes, escaping any other character will print a backslash.

those. "\&" really interpreted as "\&" . There are not many escape sequences, so in most cases you are likely to avoid one backslash. But for consistency, escaping with a backslash might be a better choice.

As always: know what you are doing :)

0
source

OK. So, I did a few more tests and found a THUMB RULE when encapsulating PCRE in DOUBLE QUOTES, the following is true:

$ - double escape is required because PHP will interpret this as the start of a variable if the text immediately follows it. Left forever, and this will show the end of your needle and break.

\r\n\t\v - Special escape lines in PHP, only one output is required.

[\^$.|?*+() - RegEx special characters, require only one output. The double escape does not seem to interrupt the expression when used unnecessarily.

" - Quotes obviously need to be escaped due to encapsulation, but you only need to escap once.

\ - Looking for a backslash? Using double quote encapsulation of your expression, this will require 3 escapes! \\ (a total of four backslashes)

Anything I miss?

0
source

I will begin to say that everything that I write below is not quite what is happening, but for clarity, I will simplify it.

Imagine that two evaluations are performed using regular expressions : the first is executed by PHP, and the second is executed using PCRE, as if they were separate mechanisms. And for our failure,

PHP and PCRE are EVALUATED IN DIFFERENT WAYS.

We have 3 "guys" here: 1) USER; 2) PHP and; 3) PCRE.

The USER interacts with PHP by writing the CODE that you enter in the code editor. PHP then evaluates this CODE and sends another bit of information to PCRE. This bit of information is different from what you entered in your CODE. Then PCRE evaluates it and returns something in PHP, which evaluates this answer and returns something to the USER.

I will explain better in the example below. There I use backslash ("\") to show what is happening.

Assume this CODE bit in a php file:

 <?php $sub = "A backslash \ in a string"; $pat1 = "#\#"; $pat2 = "#\\#"; $pat3 = "#\\\#"; $pat4 = "#\\\\#"; echo "sub: ".$sub; echo "\n\n"; echo "pat1: ".$pat1; echo "\n"; echo "pat2: ".$pat2; echo "\n"; echo "pat3: ".$pat3; echo "\n"; echo "pat4: ".$pat4; ?> 

This will print:

 sub: A backslash \ in a string pat1: #\# pat2: #\# pat3: #\\# pat4: #\\# 

There is no regular expression in this example, so there is only a PHP evaluation of the code. PHP leaves a backslash as is, unless it precedes a special character . Therefore, it correctly prints a backslash in $ sub.

PHP evaluates $ pat1 and $ pat2 EXACTLY, because in $ pat1 the backslash remains as it is, and in $ pat2 the first backslash skips the second, resulting in a single backslash.

Now, in $ pat3, the first backslash resets the second, resulting in a single backslash. PHP then evaluates the third backslash and leaves it as it is because it does not precede anything special. The result will be double backslash.

Now someone may say: "But now we again have two backslashes! Should the first not avoid the second?" The answer is no. "After PHP evaluates the first two backslashes in one, it does not look back and continues to evaluate what comes next.

At this point, you already know what happens with $ pat4: the first backslash defeats the second, and the third slashes the fourth, leaving two at the end.

Now that he has figured out what PHP is doing with these lines, add another code after the previous one.

 if (preg_match($pat1, $sub)) echo "test1: true"; else echo "test1: false"; echo "\n"; if (preg_match($pat2, $sub)) echo "test2: true"; else echo "test2: false"; echo "\n"; if (preg_match($pat3, $sub)) echo "test3: true"; else echo "test3: false"; echo "\n"; if (preg_match($pat4, $sub)) echo "test4: true"; else echo "test4: false"; 

And the result:

 test1: false test2: false test3: true test4: true 

So what happens here, PHP does not send “what you typed” to CODE directly to PCRE. Instead, PHP sends what it appreciated earlier (this is exactly what we saw above).

For test1 and test2, despite the fact that we wrote different patterns in CODE for each test, PHP sends the same pattern # \ # to PCRE. The same thing happens for test3 and test4: PHP sends # \\ # . Thus, the results for test1 and test2 are the same, as well as for test3 and test4.

Now, what happens when PCRE evaluates these patterns? PCRE does not work like PHP.

In tests1 and test2, when PCRE sees that one backslash has nothing special (or nothing at all), it does not leave it as it is. Instead, he probably thinks, "What the hell is this?" and returns a PHP error (in fact, I really don’t know what happens when sending one backslash to PCRE, I searched for it, but there is still no final one). Then PHP takes what we assume is an error and evaluates it as “false” and returns it to the rest of the code (in this example, the if () function).

In tests test3 and test4, everything happens as we now expect: PCRE evaluates the first backslash as the acceleration of the second, which results in one backslash. This, of course, matches $ sub string and returns a "successful message" to PHP, which evaluates it to "true".

ANSWERED QUESTIONS
Some characters are special for PHP (for example, n for NEW LINE, t for TAB).
Some characters are special for PCRE (e.g . , . (Period) to match any s character to match spaces).
And some characters are special for both (for example, $ for php is the beginning of the variable name, and for PCRE is the end of the object).

This is why you need to avoid newlines only once, for example \ n . PHP will evaluate it as a REAL character NEW LINE and send it to PCRE.

For a period, if you want to match this particular character, you must use \. and PHP will not do anything because the dot is not a special character for PHP in the string . Instead, he will send them, like PCRE. Now on PCRE, he “sees” the backslash before the dot and realizes that it must match that particular character. If you use double escape \\. The first backslash will escape the second, leaving you with the same result.

And if you want to match the dollar sign in the string, you should use \\\ $ . In PHP, the first backslash is flushed to the second, leaving one backslash. Then the third backslash will come out of the dollar sign. As a result, the result is \ $ . This is what PCRE will get. PCRE will see that backslash and understand that the dollar sign does not approve the end of the subject, but is literal.

QUOTES

And now we have come to quotes. The problem with them is that PHP evaluates the string differently, depending on the quotes used for its environment. Check it out: Rows

All that I said so far this moment is not suitable for double quotes. If you try this '\ n' in single quotes, PHP will evaluate this backslash as a literal. But, if it is used in a regular expression, PCRE will get this string as is. And since n is also special for PCRE, it will interpret it as a newline and BOOM, it "magicaly" matches a newline in a line. Check escape sequences here: Escape Sequences

As I said at the beginning, everything is not as accurate as I tried to explain here, but I really hope this helps (and does not make it more confusing than it already is).

0
source

All Articles