Pyparsing - parse xml comment

Question

Pyparsing - parse xml comment

I need to parse the xml comment file. In particular, it is a C # file using the MS /// convention.

From this I need to get foobar , or /// foobar will be acceptable too. (Note - this still doesn't work if you do xml all on one line ...)

 testStr = """ ///<summary> /// foobar ///</summary> """

Here is what I have:

 import pyparsing as pp _eol = pp.Literal("\n").suppress() _cPoundOpenXmlComment = Suppress('///<summary>') + pp.SkipTo(_eol) _cPoundCloseXmlComment = Suppress('///</summary>') + pp.SkipTo(_eol) _xmlCommentTxt = ~_cPoundCloseXmlComment + pp.SkipTo(_eol) xmlComment = _cPoundOpenXmlComment + pp.OneOrMore(_xmlCommentTxt) + _cPoundCloseXmlComment match = xmlComment.scanString(testStr)

and for output:

 for item,start,stop in match: for entry in item: print(entry)

But I did not have much success with a grammar working with multiple lines.

(note - I tested the above sample in python 3.2, it works, but (by my problem) does not print any values)

Thanks!

+4

python grammar pyparsing xml-comments

mike Oct 19 '11 at 16:54

source share

3 answers

I think Literal('\n') is your problem. You do not want to create a literal with whitespace (since literals by default skip spaces before trying to match). Use LineEnd() .

EDIT 1: Just because you get an infinite loop with LineEnd doesn't mean Literal ('\ n') is better. Try adding .setDebug() at the end of your _eol definition and you will see that it never matches anything.

Instead of trying to define the body of your comment as "one or more lines that are not the closing line, but all to the end of the line", what if you just do:

 xmlComment = _cPoundOpenXmlComment + pp.SkipTo(_cPoundCloseXmlComment) + _cPoundCloseXmlComment

(The reason you were getting an infinite loop with LineEnd () was because you essentially did OneOrMore (SkipTo (LineEnd ())) but never consumed LineEnd (), so OneOrMore just kept matching, matching and matching, parsing, and returning an empty line because the parsing position was at the end of the line.)

+2

Paulmcg Oct 19 '11 at 19:02

source share

You can use the xml parser to parse the xml. It should be easy for you to extract the corresponding comment lines:

 import re from xml.etree import cElementTree as etree # extract all /// lines lines = re.findall(r'^\s*///(.*)', text, re.MULTILINE) # parse xml root = etree.fromstring('<root>%s</root>' % ''.join(lines)) print root.findtext('summary') # -> foobar

+1

jfs Oct 19 '11 at 10:51

source share

unutbu · Accepted Answer · 2011-10-19T19:51:42+0000

How about using nestedExpr :

 import pyparsing as pp text = '''\ ///<summary> /// foobar ///</summary> blah blah ///<summary> /// bar ///</summary> ///<summary> ///<summary> /// baz ///</summary> ///</summary> ''' comment=pp.nestedExpr("///<summary>","///</summary>") for match in comment.searchString(text): print(match) # [['///', 'foobar']] # [['///', 'bar']] # [[['///', 'baz']]]

Pyparsing - parse xml comment

More articles: