Nested text in regular expressions

Question

Nested text in regular expressions

I struggle with regular expressions. I have problems so that my head wraps around a similar text embedded in a larger text. Perhaps you can help me sort out my thoughts.

Here is an example of a test line:

message msgName { stuff { innerStuff } } \n message mn2 { junk }

I want to output a term (e.g. msgName , mn2 ) and what follows before the following message to get a list like this:

  msgName 
 {stuff {innerStuff} more stuff} 
 mn2 
 {junk} '

I am having problems with too greed or without greed to keep the inner brackets, but to split the messages of a higher level.

Here is one program:

 import re text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }' messagePattern = re.compile('message (.*?) {(.*)}', re.DOTALL) messageList = messagePattern.findall(text) print "messages:\n" count = 0 for message, msgDef in messageList: count = count + 1 print str(count) print message print msgDef

He produces:

  messages:

 1
 msgName
  stuff {innerStuff} more stuff} 
  message mn2 {junk

Here is my next attempt, which makes the inside inanimate:

 import re text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }' messagePattern = re.compile('message (.*?) {(.*?)}', re.DOTALL) messageList = messagePattern.findall(text) print "messages:\n" count = 0 for message, msgDef in messageList: count = count + 1 print str(count) print message print msgDef

He produces:

  messages:

 1
 msgName
  stuff {innerStuff 
 2
 mn2
  junk

So I'm losing } more stuff }

I really came across a mental block. Can someone point me in the right direction? I cannot process text in nested brackets. It would be useful to make a proposal for a working regular expression or a simpler example of working with nested similar text.

+5

python python-2.7 regex

Xyz Apr 29 '16 at 12:40

source share

1 answer

Wiktor stribiżew · Accepted Answer · 2016-04-29T13:10:51+0000

If you can use the PyPi regex module , you can use the support of your routines:

 >>> import regex >>> reg = regex.compile(r"(\w+)\s*({(?>[^{}]++|(?2))*})") >>> s = "message msgName { stuff { innerStuff } } \n message mn2 { junk }" >>> print(reg.findall(s)) [('msgName', '{ stuff { innerStuff } }'), ('mn2', '{ junk }')]

The regular expression - (\w+)\s*({(?>[^{}]++|(?2))*}) - matches:

(\w+) - Group 1 matches 1 or more alphanumeric / underscores
\s* - 0+ spaces (s)
({(?>[^{}]++|(?2))*}) is group 2 corresponding to a { followed by not {} or another balanced {...} due to the call of the subroutine (?2) (recurses the entire subpattern of the 2nd group), 0 or more times, and then corresponds to closing } .

If there is only one level of nesting, re can also be used with

 (\w+)\s*{[^{}]*(?:{[^{}]*}[^{}]*)*}

Watch the regex demo

(\w+) - Corresponding dictionary characters of group 1
\s* - spaces 0+
{ - opening bracket
[^{}]* - 0+ characters except { and }
(?:{[^{}]*}[^{}]*)* - 0+ sequences:
- { - opening bracket
- [^{}]* - 0+ characters except { and }
- } - closing bracket
- [^{}]* - 0+ characters except { and }
} - closing bracket

Nested text in regular expressions

More articles: