Confused by backslash in regular expressions

Question

Confused by backslash in regular expressions

I am confused with backslash in regular expressions. Within the regular expression, a \ has special meaning, for example. \d means decimal digit. If you add a backslash before a backslash, this special meaning is lost. In regex-howto you can read:

Perhaps the most important metacharacter is the backslash, \ . As with Python string literals, a backslash can be followed by different characters to signal different special sequences. It was also used to avoid all metacharacters so that you could still match them with patterns; for example, if you need to match [ or \ , you can use a backslash in front of them to remove their special meaning: \[ or \\ .

So print(re.search('\d', '\d')) gives None because \d matches any decimal digit, but there is not one in \d .

Now I expect print(re.search('\\d', '\d')) to match \d , but the answer is still None .

Only print(re.search('\\\d', '\d')) gives <_sre.SRE_Match object; span=(0, 2), match='\\d'> <_sre.SRE_Match object; span=(0, 2), match='\\d'> .

Does anyone have an explanation?

+18

python regex

tobmei05 Nov 07 '15 at 11:21

source share

3 answers

The r before the regular expression in the search () call indicates that the regular expression is a raw string. This allows you to use the backslash in a regular expression as regular characters, rather than in an escape sequence of characters. Let me explain ...

Before the re module search method processes the lines passed to it, the Python interpreter performs an initial pass through the line. If there are backslashes in the string, the Python interpreter must decide whether each of them is part of the Python escape sequence (e.g. \ n or \ t) or not.

Note: at the moment, Python does not care about whether the '\' is a regular expression meta-character.

If the "\" is followed by a recognized Python escape character (t, n, etc.), then the backslash and escape character are replaced with the actual Unicode or 8-bit character. For example, '\ t' will be replaced by the ASCII character for the tab. Otherwise, it is transmitted and interpreted as the '\' character.

Think about the following.

 >>> s = '\t' >>> print ("[" + s + "]") >>> [ ] // an actual tab character after preprocessing >>> s = '\d' >>> print ("[" + s + "]") >>> [\d] // '\d' after preprocessing

Sometimes we want to include a character sequence in a string that includes '\', without interpreting Python as an escape sequence. To do this, we avoid "\" with "\". Now that Python sees "\", it replaces the two backslashes with a single "\" character.

 >>> s = '\\t' >>> print ("[" + s + "]") >>> [\t] // '\t' after preprocessing

After the Python interpreter passes both strings, they will be passed to the re module search method. The search method parses the regular expression string to determine the regular expression metacharacters.

Now '\' is also a special regular expression metacharacter and is interpreted as one IF it is not escaped during the execution of the re search () method.

Consider the following call.

 >>> match = re.search('a\\t','a\\t') //Match is None

There are no matches. Why? Let's look at the lines after the Python interpreter does this.

 String 1: 'a\t' String 2: 'a\t'

So why is the match equal to None? When search () interprets line 1 because it is a regular expression, the backslash is interpreted as a metacharacter, not a regular character. The backslash in line 2, however, is not in the regular expression and has already been processed by the Python interpreter, so it is interpreted as a regular character.

Therefore, the search () method searches in the string 'a \ t' for escape-t that do not match.

To fix this, we can say that the search () method should not interpret '\' as a metacharacter. We can do this by avoiding it.

Consider the following call.

 >>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'

Again, let's look at the lines after the Python interpreter has passed.

 String 1: 'a\\t' String 2: 'a\t'

Now that the search () method processes the regular expression, it sees that the second backslash is escaped first and should not be treated as a metacharacter. Therefore, it interprets the line as 'a \ t', which corresponds to line 2.

An alternative way to make search () look at the "\" character is to put r in front of the regular expression. This tells the Python interpreter NOT to pre-process the string.

Keep this in mind.

 >>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'

Here, the Python interpreter does not change the first line, but processes the second line. Lines passed to search ():

 String 1: 'a\\t' String 2: 'a\t'

As in the previous example, search interprets '\' as a separate character '\', and not as a metacharacter, so it matches line 2.

+8

eric.mcgregor Apr 21 '16 at 20:56

source share

Python's native parsing is (partially) in your way.

If you want to see what re sees, type

 print '\d' print '\\d' print '\\\d'

at the python command line. You see that \d and \\d both result in \d , the latter being processed by the Python string parser.

If you want to avoid any problems with them, use the raw strings as suggested by re module documentation : r'\\d' will result in \\d visible by the RE module.

+4

glglgl Nov 07 '15 at 11:28

source share

Tom karzes · Accepted Answer · 2015-11-07T11:54:45+0000

The confusion is that the backslash character \ used as an output on two different levels. First, the Python interpreter itself does the replacements for \ before the re module sees your string. For example, \n converted to a newline character, \t converted to a tab character, etc. To get the actual \ character, you can also escape it, so \\ gives one \ . ] character. If the character following \ is not a recognized escape character, then \ treated like any other character and passes through it, but I do not recommend depending on it. Instead, always avoid your \ characters by doubling them, i.e. \\ .

If you want to see how Python expands your lines, just print the line. For example:

 s = 'a\\b\tc' print(s)

If s is part of an aggregated data type, such as a list or tuple, and if you print this aggregate, Python will enclose the string in single quotes and include the escape characters \ (in canonical form), so be careful how your string is printed. If you simply type the string in quotation marks in the interpreter, it will also display it in quotation marks with \ characters.

Once you know how your string is encoded, you might think what the re module will do with it. For example, if you want to escape \ in the string passed to the re module, you will need to pass \\ to re , which means you will need to use \\\\ in your quoted Python string. The Python string will end with \\ , and the re module will treat this as a single literal character \ .

An alternative way to include \ characters in Python strings is to use raw strings, for example, r'a\b' equivalent to "a\\b" .

Confused by backslash in regular expressions

More articles: