Python regex: captures parts of multiple lines that contain spaces

I am trying to grab substrings from a string that looks like

'some string, another string, ' 

I want the results matching group to be

 ('some string', 'another string') 

my current solution

 >>> from re import match >>> match(2 * '(.*?), ', 'some string, another string, ').groups() ('some string', 'another string') 

it works, but it does not seem possible - what I am showing here, of course, is significantly reduced in terms of complexity compared to what I do in a real project; I want to use only one "direct" (not calculated) regex pattern. Unfortunately, my attempts failed:

This does not match (no as a result), because {2} applies only to space, and not to the entire line:

 >>> match('.*?, {2}', 'some string, another string, ') 

Adding brackets around a repeating line contains a comma and a space as a result

 >>> match('(.*?, ){2}', 'some string, another string, ').groups() ('another string, ',) 

adding another set of brackets fixes this, but there are too many of me:

 >>> match('((.*?), ){2}', 'some string, another string, ').groups() ('another string, ', 'another string') 

adding a non-capture modifier improves the result, but still skips the first line

 >>> match('(?:(.*?), ){2}', 'some string, another string, ').groups() ('another string',) 

I feel like I'm near, but I really can’t find a suitable way.

Can anyone help me? Any other approaches that I don't see?


Update after the first few answers:

First of all, thank you all, your help is greatly appreciated !:-)

As I said in the original publication, I omitted many difficulties in my question in order to portray the actual underlying problem. Firstly, in the project I'm working on, I process a large number of files (currently tens of thousands per day) in numbers (currently 5, soon ~ 25, possibly hundreds later) from different linear formats. There are also XML, JSON, binary and some other data file formats, but don’t stop there.

To handle many file formats and exploit the fact that many of them are line-based, I created a somewhat general Python module that loads one file after another, applies a regular expression for each line, and returns a large data structure with matches. This module is a prototype, for the production version you will need a C ++ version to improve performance, which will be connected via Boost :: Python and will probably add the subject of regular expression dialects to the list of difficulties.

In addition, there are no 2 repetitions, but the number varying between zero and 70 (or so), the comma is not always a comma, and despite what I said initially, some parts of the regex pattern have to be computed at runtime ; let me say that I have a reason to try to reduce the "dynamic" amount and have as many "fixed" templates as possible.

So, in one word: I have to use regular expressions.


Trying to rephrase: I think the essence of the problem is as follows: is there a Python RegEx entry, for example, includes repeated curly braces and allows me to capture

 'some string, another string, ' 

in

 ('some string', 'another string') 

?

Hmmm, this probably narrows it too far - but then, in some way, you are doing it wrong: -D


Second attempt to rephrase: Why can't I see the first line ("some line") as a result? Why does the regex create a match (indicates that there must be something 2) but returns only one line (second)?

The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:

 >>> match('(?:(.*?), )+', 'some string, another string, ').groups() ('another string',) 

Also, this is not the second line that was returned, the last one:

 >>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups() ('third string',) 

Again, thanks for your help, it never ceases to amaze me how useful an expert review is, trying to figure out what I really want to know ...

0
source share
6 answers

To summarize, it seems like I'm already using the best solution by building a regular expression pattern in a dynamic way:

 >>> from re import match >>> match(2 * '(.*?), ', 'some string, another string, ').groups() ('some string', 'another string') 

 2 * '(.*?) 

- this is what I mean by dynamics. Alternative approach

 >>> match('(?:(.*?), ){2}', 'some string, another string, ').groups() ('another string',) 

cannot return the desired result due to the fact that (as Glenn and Alan kindly explained)

coincidentally, captured content gets overwritten with each capture capture group

Thanks for all your help !:-)

-one
source

If this problem is not as much as you explained, I do not see the point in using regular expressions. It is very simple to deal with using basic string methods:

 [s.strip() for s in mys.split(',') if s.strip()] 

Or if it should be a tuple:

 tuple(s.strip() for s in mys.split(',') if s.strip()) 

The code is also readable. Please tell me if this does not fit.


EDIT: Well, this problem is really bigger than it seemed initially. However, leave this for historical purposes. (I think I'm not "disciplined" :))

+5
source

As described, I think this regex works fine:

 import re thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match thepattern.findall("a, b, asdf, d") # until comma or end of line # Result: Out[19]: ['a', ' b', ' asdf', ' d'] 

The key point here is findall , not match . In the wording of your question, it is assumed that you prefer match , but this is not a suitable tool for working here - it is designed to return exactly one line for each corresponding group ( ) to a regular expression. Since your row count is variable, the correct approach is to use findall or split .

If this is not what you need, ask the question more specifically.

Edit: And if you should use tuples, not lists:

 tuple(Out[19]) # Result Out[20]: ('a', ' b', ' asdf', ' d') 
+4
source
 import re regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *" print re.match(regex, 'some string, another string, ').groups() # ('some string', 'another string') print re.match(regex, ' some string, another string, ').groups() # ('some string', 'another string') print re.match(regex, ' some string , another string, ').groups() # ('some string', 'another string') 
+2
source

Do not be offended, but you obviously need to learn a lot about regular expressions, and what you are going to study, in the long run, is that regular expressions cannot do the job. I am sure that this particular task is feasible with regular expressions, but what then? You say that you have potentially hundreds of different file formats for parsing! You even mentioned JSON and XML, which are fundamentally incompatible with regular expressions.

Do yourself a favor: forget about regular expressions and learn pyparsing . Or skip Python completely and use a standalone parser generator such as ANTLR . In any case, you will probably find that the grammars for most of your file formats are already written.

+1
source

I think that the essence of the problem is boiling down: is there a Python RegEx notation, for example, that includes curly repeats curly braces and allows me to grab 'some string, another string,'?

I don’t think there is such a notation.

But regular expressions do not apply only to NOTATION, that is, to the RE string used to define the regular expression. It is also a question of TOOLS, that is, functions.

Unfortunately, I cannot use findall as the line from the initial question is only part of the problem, the real line is much longer, so findall only works if I do a few regex findalls / matches / search.

You must provide additional information without delay: we could more quickly understand what the limitations are. Since, in my opinion, to answer your problem as it was discovered, findall () is really okay:

 import re for line in ('string one, string two, ', 'some string, another string, third string, ', # the following two lines are only one string 'Topaz, Turquoise, Moss Agate, Obsidian, ' 'Tigers-Eye, Tourmaline, Lapis Lazuli, '): print re.findall('(.+?), *',line) 

Result

 ['string one', 'string two'] ['some string', 'another string', 'third string'] ['Topaz', 'Turquoise', 'Moss Agate', 'Obsidian', 'Tigers-Eye', 'Tourmaline', 'Lapis Lazuli'] 

Now, since you have “omitted a lot of complexity” in your question, findall () , by the way, may not be acceptable for this complexity. Finditer () will then be used as it provides more flexibility in choosing match groups

 import re for line in ('string one, string two, ', 'some string, another string, third string, ', # the following two lines are only one string 'Topaz, Turquoise, Moss Agate, Obsidian, ' 'Tigers-Eye, Tourmaline, Lapis Lazuli, '): print [ mat.group(1) for mat in re.finditer('(.+?), *',line) ] 

gives the same result and can be complexified by writing another expression instead of mat.group (1)

0
source

All Articles