Python-regex, what's going on here?

I recently have a python book and it has a chapter on Regex, there is a section of code there that I cannot understand. Can someone explain what is happening here (this section is about Regex groups)?

>>> my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(State:\s*\w\w)' >>> addrs = "Zip: 10010 State: NY" >>> y = re.search(my_regex, addrs) >>> y.groupdict('zip') {'zip': 'Zip: 10010'} >>> y.group(2) 'State: NY' 
+6
python regex
source share
6 answers

regex definition:

 (?P<zip>...) 

Creates a named zip group

 Zip:\s* 

Match “Zip:” with Zero or More Whitespace

 \d 

Corresponds to the number

 \w 

Matches the word character [A-Za-z0-9 _]

 y.groupdict('zip') 

The groupdict method returns a dictionary with named groups as keys and their matches as values. In this case, a match is returned for the zip group

 y.group(2) 

Return a match for the second group, which is the nameless group "(...)"

Hope this helps.

+8
source share

The search method returns an object containing the results of your regular expression pattern.

groupdict returns dictionnary groups, where the keys are the names of the groups defined (? P ...). Here, the name is the name for the group.

group returns a list of groups that match. State: New York is your third group. The first is an entire line, and the second is "Zip: 10010".

This was a relatively simple question. I just looked at the google method documentation and found this page . Google is your friend.

+2
source share
 # my_regex = r' <= this means that the string is a raw string, normally you'd need to use double backslashes # ( ... ) this groups something # ? this means that the previous bit was optional, why it just after a group bracket I know not # * this means "as many of as you can find" # \s is whitespace # \d is a digit, also works with [0-9] # \w is an alphanumeric character my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(State:\s*\w\w)' addrs = "Zip: 10010 State: NY" # Runs the grep on the string y = re.search(my_regex, addrs) 
+1
source share

Syntax (?P<identifier>match) is a Python method for implementing the named capture groups. That way you can access what match was matching using a name, not just a sequential number.

Since the first set of brackets is named zip , you can access its match using the match groupdict method to get the {identifier: match} pair. Or you can use y.group('zip') if you are only interested in matching (which usually makes sense since you already know the identifier). You can also access the same match using its serial number (1). The next match is not indicated, so the only access to it is its number.

0
source share

Adding to the previous answers: In my opinion, you better choose one type of group (named or unnamed) and stick to it. I usually use named groups. For example:

 >>> my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(?P<state>State:\s*\w\w)' >>> addrs = "Zip: 10010 State: NY" >>> y = re.search(my_regex, addrs) >>> print y.groupdict() {'state': 'State: NY', 'zip': 'Zip: 10010'} 
0
source share

strfriend is your friend:

http://strfriend.com/vis?re= (Zip% 3A \ s * \ d \ d \ d \ d \ d) \ s * (State% 3A \ s * \ w \ w)

EDIT: Why the hell does he make the entire line a link in the actual comment, but not a preview?

0
source share

All Articles