re.findall is acting weird

Source line:

# Python 3.4.3 s = r'abc123d, hello 3.1415926, this is my book' 

and here is my template:

 pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+' 

however re.search may give me the correct result:

 m = re.search(pattern, s) print(m) # output: <_sre.SRE_Match object; span=(3, 6), match='123'> 

re.findall just unload the empty list:

 L = re.findall(pattern, s) print(L) # output: ['', '', ''] 

why can't re.findall give me the expected list:

 ['123', '3.1415926'] 
+17
python regex
source share
3 answers
 s = r'abc123d, hello 3.1415926, this is my book' print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s) 

You do not need to escape twice when you use raw mode .

Output: ['123', '3.1415926']

Also the return type will be a list of strings . If you want the return type as integers and floats use map

 import re,ast s = r'abc123d, hello 3.1415926, this is my book' print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)) 

Output: [123, 3.1415926]

+7
source share

Two things to note here:

  • re.findall returns captured texts if the regular expression pattern contains capture groups
  • r'\\.' the part in your pattern matches two consecutive characters, \ and any character except a new line.

See the link for all findall :

If one or more groups are present in the template, return the list of groups; this will be a list of tuples if the template has more than one group. Empty matches are included in the result if they do not touch the start of another match.

Note that for re.findall return only matching values , you can usually

  • remove redundant capture groups (e.g., (a(b)c)abc )
  • convert all capture groups to non-capture (i.e. replace ( with (?: :), if there are no backlinks that refer to group values ​​in the template (see below)
  • use re.finditer ( [x.group() for x in re.finditer(pattern, s)] instead)

In your case, findall returned all captured texts that were empty, because you have \\ in the string findall r'' which tried to match the literal \ .

To match the numbers you need to use

 -?\d*\.?\d+ 

Regex matching:

  • -? - Optional minus sign
  • \d* - optional digits
  • \.? - Optional decimal separator
  • \d+ - 1 or more digits.

View demo

Here is the IDEONE demo :

 import re s = r'abc123d, hello 3.1415926, this is my book' pattern = r'-?\d*\.?\d+' L = re.findall(pattern, s) print(L) 
+16
source share

Just to explain why you think search returned what you want, but findall n't?

the search returns an SRE_Match object that contains some information, such as:

  • string : the attribute contains the string passed to the search function.
  • re : The REGEX object used in the search function.
  • groups() : a list of lines captured by capture groups inside REGEX .
  • group(index) : to retrieve the captured row by group using index > 0 .
  • group(0) : return a string matching REGEX .

search stops when it detects that the first mechanism has created an SRE_Match object and, returning it, check this code:

 import re s = r'abc123d' pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+' m = re.search(pattern, s) print(m.string) # 'abc123d' print(m.group(0)) # REGEX matched 123 print(m.groups()) # there is only one group in REGEX (\.[0-9]*) will empy string tgis why it return (None,) s = ', hello 3.1415926, this is my book' m2 = re.search(pattern, s) # ', hello 3.1415926, this is my book' print(m2.string) # abc123d print(m2.group(0)) # REGEX matched 3.1415926 print(m2.groups()) # the captured group has captured this part '.1415926' 

findall behaves differently because it does not just stop when it finds the first mechanism that continues to extract to the end of the text, but if REGEX contains at least one capture group, findall does not return a matching string, but a captured string by capture groups:

 import re s = r'abc123d , hello 3.1415926, this is my book' pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+' m = re.findall(pattern, s) print(m) # ['', '.1415926'] 

the first element returned when the first mechanism was discovered when the '123' capture group only captured '' , but the second element was captured in the second match '3.1415926' the capture group corresponded to this part of '.1415926' .

If you want the returned string to match findall , you must make all capture groups () in your REGEX capture groups (?:) :

 import re s = r'abc123d , hello 3.1415926, this is my book' pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+' m = re.findall(pattern, s) print(m) # ['123', '3.1415926'] 
0
source share

All Articles