Nested text in regular expressions

I struggle with regular expressions. I have problems so that my head wraps around a similar text embedded in a larger text. Perhaps you can help me sort out my thoughts.

Here is an example of a test line:

message msgName { stuff { innerStuff } } \n message mn2 { junk }

I want to output a term (e.g. msgName , mn2 ) and what follows before the following message to get a list like this:

  msgName 
 {stuff {innerStuff} more stuff} 
 mn2 
 {junk} '

I am having problems with too greed or without greed to keep the inner brackets, but to split the messages of a higher level.

Here is one program:

 import re text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }' messagePattern = re.compile('message (.*?) {(.*)}', re.DOTALL) messageList = messagePattern.findall(text) print "messages:\n" count = 0 for message, msgDef in messageList: count = count + 1 print str(count) print message print msgDef 

He produces:

  messages:

 1
 msgName
  stuff {innerStuff} more stuff} 
  message mn2 {junk 

Here is my next attempt, which makes the inside inanimate:

 import re text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }' messagePattern = re.compile('message (.*?) {(.*?)}', re.DOTALL) messageList = messagePattern.findall(text) print "messages:\n" count = 0 for message, msgDef in messageList: count = count + 1 print str(count) print message print msgDef 

He produces:

  messages:

 1
 msgName
  stuff {innerStuff 
 2
 mn2
  junk 

So I'm losing } more stuff }

I really came across a mental block. Can someone point me in the right direction? I cannot process text in nested brackets. It would be useful to make a proposal for a working regular expression or a simpler example of working with nested similar text.

+5
source share
1 answer

If you can use the PyPi regex module , you can use the support of your routines:

 >>> import regex >>> reg = regex.compile(r"(\w+)\s*({(?>[^{}]++|(?2))*})") >>> s = "message msgName { stuff { innerStuff } } \n message mn2 { junk }" >>> print(reg.findall(s)) [('msgName', '{ stuff { innerStuff } }'), ('mn2', '{ junk }')] 

The regular expression - (\w+)\s*({(?>[^{}]++|(?2))*}) - matches:

  • (\w+) - Group 1 matches 1 or more alphanumeric / underscores
  • \s* - 0+ spaces (s)
  • ({(?>[^{}]++|(?2))*}) is group 2 corresponding to a { followed by not {} or another balanced {...} due to the call of the subroutine (?2) (recurses the entire subpattern of the 2nd group), 0 or more times, and then corresponds to closing } .

If there is only one level of nesting, re can also be used with

 (\w+)\s*{[^{}]*(?:{[^{}]*}[^{}]*)*} 

Watch the regex demo

  • (\w+) - Corresponding dictionary characters of group 1
  • \s* - spaces 0+
  • { - opening bracket
  • [^{}]* - 0+ characters except { and }
  • (?:{[^{}]*}[^{}]*)* - 0+ sequences:
    • { - opening bracket
    • [^{}]* - 0+ characters except { and }
    • } - closing bracket
    • [^{}]* - 0+ characters except { and }
  • } - closing bracket
+1
source

All Articles