Grouping data using regex in Python

I have some source data:

Dear John Buy 1 of Coke, cost 10 dollars Ivan Buy 20 of Milk Dear Tina Buy 10 of Coke, cost 100 dollars Mary Buy 5 of Milk 

Data Rule:

  • Not everyone will start with Dear, and if there is, then this should end with costs

  • An element may not always be normal, it can be written without restrictions (including str, num, etc.)

I want to group information, and I tried using a regex. This is what I tried before:

 for line in file.readlines(): match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line) if match is not None: print(match.groups()) file.close() 

Now the output is as follows:

 ('John', '1', 'Coke', '10') ('Ivan', '20', 'Milk', '') ('Tina', '10', 'Coke', '100') ('Mary', '5', 'Milk', '') 

The display above is what I want. However, if item is replaced with some strange string, for example A1~A10 , some of the outputs will receive incorrect information:

 ('Ivan', '20', 'A1', '10') ('Mary', '5', 'A1', '10') 

I think the constant format in item field is that it will always end in , (if there is one). But I just don’t know how to take advantage.

We thought it was temporary success using the code above, I thought that (?P<item>\w+) should be replaced as (?P<item>.+) . If I do this, the tuple will have the wrong line:

 ('John', '1', 'Coke, cost 10 dollars', '') 

How can I read data in the format I want using a regex in Python?

+8
python regex
source share
4 answers

I tried this regex

^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?

Explanation

  • ^(Dear)? match string starting either with Dear if exists
  • (?P<name>\w*) name capture group for name capture
  • \D* matches any non-digit character
  • (?P<num>\d+) named capture group to get num .
  • \sof\s match string of
  • (?P<drink>\w*) to get a drink
  • (,\D*(?P<cost>\d+)\D*)? This is an optional group to get the cost of the drink.

from

 >>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?') 

First piece of data

 >>> data1 = 'Dear John Buy 1 of Coke, cost 10 dollars' >>> match_object = reobject.search(data1) >>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost')) ('John', '1', 'Coke', '10') 

Second piece of data

 >>> data2 = ' Ivan Buy 20 of Milk' >>> match_object = reobject.search(data2) >>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost')) ('Ivan', '20', 'Milk', None) 
+5
source share

I would use this regex :

 r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?' 

Demo

 >>> line = 'Dear Tina Buy 10 of A1~A10' >>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line) >>> match.groups() ('Tina', '10', 'A1~A10', None) >>> line = 'Dear Tina Buy 10 of A1~A10, cost 100 dollars' >>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line) >>> match.groups() ('Tina', '10', 'A1~A10', '100') 

Explanation

The first section of your regex is great, this is the hard part:

(?P<item>[^,]+) Since we are sure that the line will contain a comma when the cost line is present, here we say that we want everything except the comma to set the value of the element.

(?:,\D+)?(?P<costs>\d+)? Here we use two groups. important thing is this ? after parenthesis containing groups:

'?' Makes the resulting RE match 0 or 1 repetitions of the preceding RE. anyhow? will match either "a" or "ab".

So what are we using ? to match both possibilities (with a cost line or not)

(?:,\D+) is non-capturing , which will match a comma followed by only a digit.

(?P<costs>\d+) will fix any figure in the cost of the named group.

+5
source share

Without regex:

 with open('commandes.txt') as f: results = [] for line in f: parts = line.split(None, 5) price = '' if parts[0] == 'Dear': tmp = parts[5].split(',', 1) for tok in tmp[1].split(): if tok.isnumeric(): price = tok break results.append((parts[1], parts[3], tmp[0], price)) else: results.append((parts[0], parts[2], parts[4].split(',')[0], price)) print(results) 

It doesn’t matter what characters are used, except for spaces, before the product name, so each line is separated by spaces in 5 parts. When the line begins with "Dear", the last part is separated by a comma to extract the product name and price. Please note that if the price is always in one place (that is, after the "cost"), you can avoid the inner loop and replace it with price = tmp[1].split()[1]

Note. If you want empty lines not to be processed, you can change the first for loop to:

 for line in (x for x in f if x.rstrip()): 
+5
source share

If you use .+ , The subpattern will capture the rest of the line, because . matches any character, but a newline without the re.S flag.

You can replace \w+ with an invalid character class subpattern [^,]+ to match one or more non-comma characters:

 r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)\D*(?P<costs>\d*)' ^^^^^ 

Watch the IDEONE demo :

 import re file = "Dear John Buy 1 of A1~A10, cost 10 dollars\n Ivan Buy 20 of Milk\nDear Tina Buy 10 of Coke, cost 100 dollars\n Mary Buy 5 of Milk" for line in file.split("\n"): match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,\W]+)\D*(?P<costs>\d*)',line) if match: print(match.groups()) 

Output:

 ('John', '1', 'A1~A10', '10') ('Ivan', '20', 'Mil', '') ('Tina', '10', 'Coke', '100') ('Mary', '5', 'Mil', '') 
+3
source share

All Articles