Pyparsing capturing custom text groups with given headers as nested lists

I have a text file similar to;

section heading 1:
some words can be any words more words can be anything at all etc. lala

some other header:
as before maybe anything hey is not this fun

I am trying to compose a grammar with pyparser, which will lead to the following list structure when querying the parsed results as a list; (IE, when repeating through parsed.asList () elements, print the following:

['header 1:', [["some words can be anything"], ["more words can be anything at all"], ['etc etc lala]]]
['some other header:', [['as anything could be before "], [' hey is not this fun ']]]

Header names are known in advance, and individual headers may or may not be displayed. If they appear, then there is always at least one line of content.

The problem I am facing is that I am having problems with the gettnig parser to find out where the section heading 1 begins: 'ands and' some other header: '. I end up looking like parsed.asList ();

['header 1:', [['' some words can be anything '], [' more words can be anything at all], ['etc etc lala'], ['some other header'], ['' by -can still be anything '], [' hey is not this fun ']]]

(IE: section 1 title: displayed correctly, but each subsequent one is added to section 1 title, including additional title lines, etc.)

I tried different things, playing with leaveWhitespace () and LineEnd () in different ways, but I can't figure it out.

The basic parser I'm dealing with is this (a far-fetched example - this is actually a class definition, etc.).

header_1_line=Literal('section header 1:') text_line=Group(OneOrMore(Word(printables))) header_1_block=Group(header_1_line+Group(OneOrMore(text_line))) header_2_line=Literal('some other header:') header_2_block=Group(header_2_line+Group(OneOrMore(text_line))) overall_structure=ZeroOrMore(header_1_block|header_2_block) 

and called with

 parsed=overall_structure.parseFile() 

Greetings, Matt.

+8
python pyparsing
source share
1 answer

Matt -

Welcome to the pyraming! You got into one of the most common mistakes in working with pyparsing, and this is that people are smarter than computers. When you look at input text, you can easily see which text can be a heading and which text cannot be. Unfortunately, pyparsing is not so intuitive, so you have to directly say what may and may not be text.

When you look at your sample text, you do not accept any line of text as possible text in the section heading. How do you know that β€œanother heading:” is invalid as text? Since you know that this line corresponds to one of the known header lines. But in your current code, you told pyparsing that any Word(printables) collection Word(printables) is valid text, even if this collection is a valid section heading.

To fix this, you must add an explicit look to your parser. Pyparsing offers two constructs: NotAny and FollowedBy. NotAny can be shortened using the ~ ~ operator, so we can write this pseudo-code expression for text:

 text = ~any_section_header + everything_up_to_the_end_of_the_line 

Here is a complete parser using negative browsing to make sure you read each section while breaking the section headers:

 from pyparsing import ParserElement, LineEnd, Literal, restOfLine, ZeroOrMore, Group, StringEnd test = """ section header 1: some words can be anything more words could be anything at all etc etc lala some other header: as before could be anything hey isnt this fun """ ParserElement.defaultWhitespaceChars=(" \t") NL = LineEnd().suppress() END = StringEnd() header_1=Literal('section header 1:') header_2=Literal('some other header:') any_header = (header_1 | header_2) # text isn't just anything! don't accept header line, and stop at the end of the input string text=Group(~any_header + ~END + restOfLine) overall_structure = ZeroOrMore(Group(any_header + Group(ZeroOrMore(text)))) overall_structure.ignore(NL) from pprint import pprint print(overall_structure.parseString(test).asList()) 

In my first attempt, I forgot to look for the end of the line too, so the restOfLine expression will loop forever. By adding a second header to the end of the line, my program exits successfully. The exercise remains for you: instead of listing all possible headers, define the header line as any line that ends with the ":" character.

Good luck with your efforts, Paul

+11
source share

All Articles