Extract string contents in parentheses

Question

Extract string contents in parentheses

I have the following line:

string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

I would like to create a list of tuples in the form [(actor_name, character_name),...] as follows:

 [(Will Ferrell, Nick Halsey), (Rebecca Hall, Samantha), (Michael Pena, Frank Garcia)]

I am currently using the hack-ish way of doing this by separating the label ( and then using .rstrip ('('), for example:

 for item in string.split(','): item.rstrip(')').split('(')

Is there a better, more reliable way to do this? Thanks.

+2

python

David542 Aug 10 '11 at 1:12

source share

3 answers

Good place for regular expressions:

 >>> import re >>> pat = "([^,\(]*)\((.*?)\)" >>> re.findall(pat, "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)") [('Will Ferrell ', 'Nick Halsey'), (' Rebecca Hall ', 'Samantha'), (' Michael Pena ', 'Frank Garcia')]

+2

Benjamin peterson Aug 10 '11 at 1:24

source share

Somewhat more explicit answer than others, I think it meets your needs:

 import re regex = re.compile(r'([a-zA-Z]+ [a-zA-Z]+) \(([a-zA-Z]+ [a-zA-Z]+)\)') actor_character = regex.findall(string)

I find this a little ugly, but, as I said, more explicit.

0

djhoese Aug 10 '11 at 1:31

source share

steveha · Accepted Answer · 2011-08-10T01:39:37+0000

 string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)" import re pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)') lst = [(t[0].strip(), t[1].strip()) for t in pat.findall(string)]

A compiled template is a bit more complicated. This is a raw string to make the backslash less insane. This means: create a compliance group; match everything that is not the symbol "(" symbol ", any number of times, if at least once, close the correspondence group, match the literal" ("symbol", start another group of matches, match everything that is not ')' , any number of times, if at least once, close the correspondence group, combine the alphabetic character); then match any space (including none); something really complicated. The very difficult part is the grouping that does not form a correspondence group. Instead of starting with '(' and ending with ')', it starts with "(?:", And then ends with ')'. I used this grouping so that I could place a vertical bar to allow two alternative patterns: either commas followed by any number of spaces, or the end of the line was reached (the "$" character).

Then I used pat.findall() to find all the places inside the string that match the pattern; it automatically returns tuples. I put this in a list comprehension and called .strip() for each item to clear the space.

Of course, we can simply make the regular expression even more complex and return names to it that already have free space. The regular expression gets really hairy, so we will use one of the coolest features in Python regular expressions: the "verbose" mode, where you can grow the pattern over many lines and post comments as you like. We use a string string with a triple quote, so the backslash is convenient, and several lines are convenient. Here you are:

 import re s_pat = r''' \s* # any amount of white space ([^( \t] # start match group; match one char that is not a '(' or space or tab [^(]* # match any number of non '(' characters [^( \t]) # match one char that is not a '(' or space or tab; close match group \s* # any amount of white space \( # match an actual required '(' char (not in any match group) \s* # any amount of white space ([^) \t] # start match group; match one char that is not a ')' or space or tab [^)]* # match any number of non ')' characters [^) \t]) # match one char that is not a ')' or space or tab; close match group \s* # any amount of white space \) # match an actual required ')' char (not in any match group) \s* # any amount of white space (?:,|$) # non-match group: either a comma or the end of a line ''' pat = re.compile(s_pat, re.VERBOSE) lst = pat.findall(string)

A man who really was not worth the effort.

In addition, the above saves a space inside the names. You can easily normalize an empty space to make sure it is 100% consistent by dividing into a space and rejoining with spaces.

 string = ' Will Ferrell ( Nick\tHalsey ) , Rebecca Hall (Samantha), Michael\fPena (Frank Garcia)' import re pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)') def nws(s): """normalize white space. Replaces all runs of white space by a single space.""" return " ".join(w for w in s.split()) lst = [tuple(nws(item) for item in t) for t in pat.findall(string)] print lst # prints: [('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), ('Michael Pena', 'Frank Garcia')]

Now string has a stupidly empty space: a few spaces, a tab, and even a form feed ("\ f"). The above clears it so that the names are separated by a single space.

Extract string contents in parentheses

More articles: