Regex and pattern sequences?

Question

Regex and pattern sequences?

Is there a way to map a pattern ( e\d\d ) several times by capturing each of them in a group? For example, given the string ..

 blah.s01e24e25

.. I want to get four groups:

 1 -> blah 2 -> 01 3 -> 24 4 -> 25

Obvious regex to use (in Python regex:

 import re re.match("(\w+).s(\d+)e(\d+)e(\d+)", "blah.s01e24e25").groups()

.. but I also want to match one of the following:

 blah.s01e24 blah.s01e24e25e26

It seems you cannot do (e\d\d)+ , or rather you can, but only fix the last occurrence:

 >>> re.match("(\w+).s(\d+)(e\d\d){2}", "blah.s01e24e25e26").groups() ('blah', '01', 'e25') >>> re.match("(\w+).s(\d+)(e\d\d){3}", "blah.s01e24e25e26").groups() ('blah', '01', 'e26')

I want to do this in one regex, because I have several patterns to match the episode names of TV episodes and I don’t want to duplicate each expression to handle multiple episodes:

 \w+\.s(\d+)\.e(\d+) # matches blah.s01e01 \w+\.s(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02 \w+\.s(\d+)\.e(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02e03 \w - \d+x\d+ # matches blah - 01x01 \w - \d+x\d+\d+ # matches blah - 01x01x02 \w - \d+x\d+\d+\d+ # matches blah - 01x01x02x03

.. etc. for many other models.

One more thing that complicates matters is that I want to save these regular expressions in a configuration file, so a solution using several regular expressions and function calls is not required, but if this is not possible, I just let the user add simple regular expressions

Basically, is there a way to capture a repeating pattern using a regular expression?

+4

python regex sequences

dbr Jun 27 '09 at 19:45

source share

5 answers

Do this in two steps to find all the numbers, then separate them:

 import re def get_pieces(s): # Error checking omitted! whole_match = re.search(r'\w+\.(s\d+(?:e\d+)+)', s) return re.findall(r'\d+', whole_match.group(1)) print get_pieces(r"blah.s01e01") print get_pieces(r"blah.s01e01e02") print get_pieces(r"blah.s01e01e02e03") # prints: # ['01', '01'] # ['01', '01', '02'] # ['01', '01', '02', '03']

+5

Richiehindle Jun 27 '09 at 19:53

source share

The number of captured groups equal to the number of brackets. See findall or finditer to solve your problem.

+1

Rorick Jun 27 '09 at 19:56

source share

non-group parentheses: (?: Asdfasdg)

which should not appear:? (: Adsfasdf)

 c = re.compile(r"""(\w+).s(\d+) (?: e(\d+) (?: e(\d+) )? )? """, re.X)

or

 c = re.compile(r"""(\w+).s(\d+)(?:e(\d+)(?:e(\d+))?)?""", re.X)

+1

Adrian panasiuk Jun 27 '09 at 20:18

source share

Maybe something like that?

 def episode_matcher(filename): m1= re.match(r"(?i)(.*?)\.s(\d+)((?:e\d+)+)", filename) if m1: m2= re.findall(r"\d+", m1.group(3)) return m1.group(1), m1.group(2), m2 # auto return None here >>> episode_matcher("blah.s01e02") ('blah', '01', ['02']) >>> episode_matcher("blah.S01e02E03") ('blah', '01', ['02', '03'])

0

tzot Jun 28 '09 at 1:11

source share

dbr · Accepted Answer · 2009-06-27T20:32:22+0000

After thinking about the problem, I think that I have a simpler solution using the named groups.

The simplest regular expression that the user (or I) can use is:

 (\w+\).s(\d+)\.e(\d+)

The file name parsing class will take the first group as the name of the show, the second as the season number, and the third as the episode number. This covers most files.

I will allow several different named groups for them:

 (?P<showname>\w+\).s(?P<seasonnumber>\d+)\.e(?P<episodenumber>\d+)

To support multiple episodes, I support two named groups, something like startingepisodenumber and endingepisodenumber to support things like showname.s01e01-03 :

 (?P<showname>\w+\)\.s(?P<seasonnumber>\d+)\.e(?P<startingepisodenumber>\d+)-(?P<endingepisodenumber>e\d+)

Finally, allow named groups with names corresponding to episodenumber\d+ ( episodenumber1 , episodenumber2 , etc.):

 (?P<showname>\w+\)\. s(?P<seasonnumber>\d+)\. e(?P<episodenumber1>\d+) e(?P<episodenumber2>\d+) e(?P<episodenumber3>\d+)

It still requires duplication of templates for different e01 s sums, but there will never be files with two episodes that are not consecutive (for example, show.s01e01e03e04 ), so the use of starting/endingepisodenumber should solve this problem, and for strange cases with which faced by users, they can use group names episodenumber\d+

This does not answer the question about the sequence of patterns, but it solves the problem that made me ask about it! (I will still be accepting another answer that shows how to match s01e23e24...e27 in one regular expression - if someone does!)

Regex and pattern sequences?

More articles: