Split text into paragraphs where paragraph separators are non-standard

If I have text with standard paragraph formatting (an empty line followed by indentation), for example text 1, it is easy enough to extract paragraphs using text.split ("\ n \ n").

Text 1:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sit amet sapien velit, ac sodales ante. Integer mattis eros non turpis interdum et auctor enim consectetur, etc. Praesent molestie suscipit bibendum. Donec justo purus, venenatis eget convallis sed, feugiat vitae velit,etc. 

But what if I have text with custom steam formatting, like text 2? No empty lines or spaces with variable leading.

Text 2:

  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sit amet sapien velit, ac sodales ante. Integer mattis eros non turpis interdum et auctor enim consectetur, etc. Praesent molestie suscipit bibendum. Donec justo purus, venenatis eget convallis sed, feugiat vitae velit,etc. 

Since leading spaces are common between standard and custom formats, I was thinking of matching regular expressions for leading spaces and getting paragraph breaks this way, but there should be a more elegant way to do this.

+4
source share
1 answer

The solution suggested by regex seems pretty elegant:

 re.split('\s{4,}',text) 

As a paragraph separator, 4 consecutive space characters are used. You can use '\n\s{3,}' or something similar if it works better.

+9
source

All Articles