I have some source data:
Dear John Buy 1 of Coke, cost 10 dollars Ivan Buy 20 of Milk Dear Tina Buy 10 of Coke, cost 100 dollars Mary Buy 5 of Milk
Data Rule:
Not everyone will start with Dear, and if there is, then this should end with costs
An element may not always be normal, it can be written without restrictions (including str, num, etc.)
I want to group information, and I tried using a regex. This is what I tried before:
for line in file.readlines(): match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line) if match is not None: print(match.groups()) file.close()
Now the output is as follows:
('John', '1', 'Coke', '10') ('Ivan', '20', 'Milk', '') ('Tina', '10', 'Coke', '100') ('Mary', '5', 'Milk', '')
The display above is what I want. However, if item is replaced with some strange string, for example A1~A10 , some of the outputs will receive incorrect information:
('Ivan', '20', 'A1', '10') ('Mary', '5', 'A1', '10')
I think the constant format in item field is that it will always end in , (if there is one). But I just don’t know how to take advantage.
We thought it was temporary success using the code above, I thought that (?P<item>\w+) should be replaced as (?P<item>.+) . If I do this, the tuple will have the wrong line:
('John', '1', 'Coke, cost 10 dollars', '')
How can I read data in the format I want using a regex in Python?