The Python regex matches the number separated by a comma - why doesn't this work?

I am trying to parse transaction letters from my (German) bank. I would like to extract all the numbers from the next line, which turned out to be more complicated than I thought. Option 2 does almost what I want. Now I want to change it to capture, for example. 80.

My first attempt is option 1, which returns only garbage. Why does it return so many blank lines? It should always have at least the number from the first \ d +, no?

Option 3 works (or at least works as expected), so I somehow answer my question. I guess I basically knock my head about why option 2 doesn't work.

# -*- coding: utf-8 -*- import re my_str = """ Dividendengutschrift für inländische Wertpapiere Depotinhaber : ME Extag : 18.04.2013 Bruttodividende Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR Valuta : 18.04.2013 Bruttodividende : 78,40 EUR *Einbeh. Steuer : 20,67 EUR Nettodividende : 78,40 EUR Endbetrag : 57,73 EUR """ print re.findall(r'\d+(,\d+)?', my_str) print re.findall(r'\d+,\d+', my_str) print re.findall(r'[-+]?\d*,\d+|\d+', my_str) 

Exit

 ['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73'] ['0,9800', '78,40', '20,67', '78,40', '57,73'] ['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73'] 
+8
python regex
source share
6 answers

Option 1 is the most suitable of the regular expression, but it doesn’t work correctly, because findall returns what matches the capture group () , and not the full match.

For example, the first three matches in your example will be 18 , 04 and 2013 , and in each case the capture group will not match, so an empty line will be added to the list of results.

The solution is to make the group non-exciting

 r'\d+(?:,\d+)?' 

Option 2 does not work only if it does not match sequences that do not contain a comma.

Option 3 is small because it will match, for example. +,1 .

+8
source share

I would like to extract all the numbers from the next line ...

By "numbers", if you mean both the amount of the currency and the date, I think this will do what you want:

 print re.findall(r'[0-9][0-9,.]+', my_str) 

Output:

 ['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73'] 

If by "numbers" you mean only currency amounts, use

 print re.findall(r'[0-9]+,[0-9]+', my_str) 

Or maybe even better

 print re.findall(r'[0-9]+,[0-9]+ EUR', my_str) 
+3
source share

This question is important; the following

 print re.findall(r'\d+(?:,\d+)?', my_str) ^^ 

Outputs

 ['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73'] 

Eliminating dotted numbers is a bit trickier:

 print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str) ^^^^^^^^^^^ ^^^^^^^^^^ 

Displays

 ['0,9800', '78,40', '20,67', '78,40', '57,73'] 
+1
source share

Here is a solution that the operator analyzes and puts the result in a dictionary called bank_statement :

 # -*- coding: utf-8 -*- import itertools my_str = """ Dividendengutschrift für inländische Wertpapiere Depotinhaber : ME Extag : 18.04.2013 Bruttodividende Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR Valuta : 18.04.2013 Bruttodividende : 78,40 EUR *Einbeh. Steuer : 20,67 EUR Nettodividende : 78,40 EUR Endbetrag : 57,73 EUR """ bank_statement = {} for line in my_str.split('\n'): tokens = line.split() #print tokens it = iter(tokens) category = '' for token in it: if token == ':': category = category.strip(' *') bank_statement[category] = next(it) category = '' else: category += ' ' + token # bank_statement now has all the values print '\n'.join('{0:.<18} {1}'.format(k, v) \ for k, v in sorted(bank_statement.items())) 

The output of this code is:

 Bruttodividende... 78,40 Depotinhaber...... ME Einbeh. Steuer.... 20,67 Endbetrag......... 57,73 Extag............. 18.04.2013 Nettodividende.... 78,40 Valuta............ 18.04.2013 Zahlungstag....... 18.04.2013 pro StĂĽck........ 0,9800 

Discussion

  • Code scans line by line by line
  • Then it breaks each line into tokens
  • Scan through markers and colon search. If found, use the part before the colon as a category, and the part after that as a value. bank_statement['Extag'] for example, has a value of '18 .04.2013 '
  • Note that all values ​​are strings, not numbers, but this is just for converting them.
+1
source share

Try the following:

 re.findall(r'\d+(?:[\d,.]*\d)', my_str) 

This regular expression requires that the match at least start with a number, then any number of combinations of numbers, comma and periods, and then it must also end with a number.

0
source share

Option 2 does not match the numbers, for example, '18.04.2014 'because you match' \ d +, \ d + ', which means

digit (one or more) comma (one or more)

To parse the numbers in your case, I will use

 \s(\d+[^\s]+) 

which translates to

 space (get digit [one or more] get everything != space) space = \s get digit = \d one or more = + (so it becomes \d+) get everything != space = [^\s] one or more = + (so it becomes [^\s]+ 
0
source share

All Articles