Parsing a string for nested patterns

What would be the best way to do this.

Input string

<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3> 

Expected Result

 {'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \ using either camera now they are just sitting and collecting dust.':[133, 135], 'The other system worked for about 1 month': [116], 'on it then it started doing the same thing as the first one':[137] } 

which looks like a recursive search in a regex, but I can't figure out exactly how.

I can think of a tedious recursive function at the moment, but I feel that there should be a better way.

Related question: Can regular expressions be used to match nested patterns?

+3
source share
6 answers

Use expat or another XML parser; it is more explicit than anything else, given that you are still dealing with XML data.

However, note that XML element names cannot begin with a number, since your example has them.

Here's a parser that will do what you need, although you need to tweak it to combine repeating elements into a single dict key:

 from xml.parsers.expat import ParserCreate open_elements = {} result_dict = {} def start_element(name, attrs): open_elements[name] = True def end_element(name): del open_elements[name] def char_data(data): for element in open_elements: cur = result_dict.setdefault(element, '') result_dict[element] = cur + data if __name__ == '__main__': p = ParserCreate() p.StartElementHandler = start_element p.EndElementHandler = end_element p.CharacterDataHandler = char_data p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1) print result_dict 
+4
source

Take an XML parser, create it a DOM (Document Object Model), and then create a recursive algorithm that traverses all nodes, calls "text ()" on each node (which should give you text in the current node and all the children) and put it's like a key to a dictionary.

+4
source
 from cStringIO import StringIO from collections import defaultdict ####from xml.etree import cElementTree as etree from lxml import etree xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>" d = defaultdict(list) for event, elem in etree.iterparse(StringIO(xml)): d[''.join(elem.itertext())].append(int(elem.tag[1:-2])) print(dict(d.items())) 

Output:

 {'on it then it started doing the same thing as the first one': [137], 'The other system worked for about 1 month': [116], 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \ either camera now they are just sitting and collecting dust. ': [133, 135]} 
+2
source

I think grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

+1
source

Please note that you cannot really solve this by regular expression, since they do not have the expressive ability to provide the correct investment.

Take the following mini-language:

A certain number of "(" followed by the same number ")", regardless of the number.

You can make regex very easy to represent the superlanguage of this mini-language (where you do not apply the equality of the number of brackets and end brackets). You can also make the regexp very easy to represent any final sublanguage (where you limit yourself to the maximum nesting depth). But you can never imagine this exact language in regular expression.

So you will need to use grammar, yes.

+1
source

Here's an unreliable inefficient recursive regex:

 import re re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S) def iterparse(text, tag=None): if tag is not None: yield tag, text for m in re_tag.finditer(text): for tag, text in iterparse(m.group('content'), m.group('tag')): yield tag, text def strip_tags(content): nested = lambda m: re_tag.sub(nested, m.group('content')) return re_tag.sub(nested, content) txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>" d = {} for tag, text in iterparse(txt): d.setdefault(strip_tags(text), []).append(int(tag[:-2])) print(d) 

Output:

 {'on it then it started doing the same thing as the first one': [137], 'The other system worked for about 1 month': [116], 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \ either camera now they are just sitting and collecting dust. ': [133, 135]} 
0
source

All Articles