I am trying to grab substrings from a string that looks like
'some string, another string, '
I want the results matching group to be
('some string', 'another string')
my current solution
>>> from re import match >>> match(2 * '(.*?), ', 'some string, another string, ').groups() ('some string', 'another string')
it works, but it does not seem possible - what I am showing here, of course, is significantly reduced in terms of complexity compared to what I do in a real project; I want to use only one "direct" (not calculated) regex pattern. Unfortunately, my attempts failed:
This does not match (no as a result), because {2} applies only to space, and not to the entire line:
>>> match('.*?, {2}', 'some string, another string, ')
Adding brackets around a repeating line contains a comma and a space as a result
>>> match('(.*?, ){2}', 'some string, another string, ').groups() ('another string, ',)
adding another set of brackets fixes this, but there are too many of me:
>>> match('((.*?), ){2}', 'some string, another string, ').groups() ('another string, ', 'another string')
adding a non-capture modifier improves the result, but still skips the first line
>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups() ('another string',)
I feel like I'm near, but I really can’t find a suitable way.
Can anyone help me? Any other approaches that I don't see?
Update after the first few answers:
First of all, thank you all, your help is greatly appreciated !:-)
As I said in the original publication, I omitted many difficulties in my question in order to portray the actual underlying problem. Firstly, in the project I'm working on, I process a large number of files (currently tens of thousands per day) in numbers (currently 5, soon ~ 25, possibly hundreds later) from different linear formats. There are also XML, JSON, binary and some other data file formats, but don’t stop there.
To handle many file formats and exploit the fact that many of them are line-based, I created a somewhat general Python module that loads one file after another, applies a regular expression for each line, and returns a large data structure with matches. This module is a prototype, for the production version you will need a C ++ version to improve performance, which will be connected via Boost :: Python and will probably add the subject of regular expression dialects to the list of difficulties.
In addition, there are no 2 repetitions, but the number varying between zero and 70 (or so), the comma is not always a comma, and despite what I said initially, some parts of the regex pattern have to be computed at runtime ; let me say that I have a reason to try to reduce the "dynamic" amount and have as many "fixed" templates as possible.
So, in one word: I have to use regular expressions.
Trying to rephrase: I think the essence of the problem is as follows: is there a Python RegEx entry, for example, includes repeated curly braces and allows me to capture
'some string, another string, '
in
('some string', 'another string')
?
Hmmm, this probably narrows it too far - but then, in some way, you are doing it wrong: -D
Second attempt to rephrase: Why can't I see the first line ("some line") as a result? Why does the regex create a match (indicates that there must be something 2) but returns only one line (second)?
The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:
>>> match('(?:(.*?), )+', 'some string, another string, ').groups() ('another string',)
Also, this is not the second line that was returned, the last one:
>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups() ('third string',)
Again, thanks for your help, it never ceases to amaze me how useful an expert review is, trying to figure out what I really want to know ...