The most efficient way to remove extra lines in Python

Question

The most efficient way to remove extra lines in Python

I'm looking to learn how to use Python to get rid of unnecessary lines in text, like what you get from Project Gutenberg, where their text files are formatted with newlines every 70 characters or so. In Tcl, I could make a simple string map , for example:

 set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]

This will allow you to separate paragraphs separated by two new lines (or a new line and a tab), but combine lines that end with one new line (replacing a space), and remove redundant CRs. Since Python does not have a string map , I have not yet been able to find the most efficient way to discard all unnecessary newline characters, although I am sure that it is not just looking for every new line and replacing it with space. I could just evaluate the Tcl expression in Python if all else fails, but I would like to find a better Pythonic way to do the same. Can any Python expert help me?

+8

python

Mcclamrock Mar 26 '16 at 23:21

source share

3 answers

You can use regex with forward lookups:

 import re text = """ ... """ newtext = re.sub(r"\n(?=[^\n\t])", " ", text)

This will replace any new line that will not be followed by a new line or tab with a space.

+2

zondo Mar 26 '16 at 23:28

source share

I use the following script when I want to do this:

 import sys import os filename, extension = os.path.splitext(sys.argv[1]) with open(filename+extension, encoding='utf-8-sig') as (file ), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output ): *lines, last = list(file) for line in lines: if line == "\n": line = "\n\n" elif line[0] == "\t": line = "\n" + line[:-1] + " " else: line = line[:-1] + " " output.write(line) output.write(last)

The line "empty", only with a line feed, turns into two lines (to replace the deleted from the previous line). This handles files that share paragraphs with two line breaks.
The line starting with the tab receives the leading line feed (to replace the one that was removed from the previous line), and replaces its return line length with a space. This handles files that share paragraphs with a tab character.
A line that is neither empty nor begins with a tab returns its return line length in space.
The last line of the file may be missing the return line and, therefore, is copied directly.

+2

TigerhawkT3 Mar 26 '16 at 23:51

source share

ekhumoro · Accepted Answer · 2016-03-27T00:32:50+0000

The closest equivalent to tcl string map will be str.translate , but unfortunately it can only display single characters. Therefore, to obtain a similar compact example, you must use a regular expression. This can be done using the look-behind / look-ahead statements , but first you need to replace \r :

 import re oldtext = """\ This would keep paragraphs separated. This would keep paragraphs separated. This would keep paragraphs separated. \tThis would keep paragraphs separated. \rWhen, in the course of human events, it becomes necessary \rfor one people """ newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))

exit:

 This would keep paragraphs separated. This would keep paragraphs separated. This would keep paragraphs separated. This would keep paragraphs separated. When, in the course of human events, it becomes necessary for one people

I doubt it is as efficient as tcl code.

UPDATE

I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here is my tcl script:

 set fp [open "gutenberg.txt" r] set oldtext [read $fp] close $fp set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext] puts $newtext

and my python equivalent:

 import re with open('gutenberg.txt') as stream: oldtext = stream.read() newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', '')) print(newtext)

Crude oil performance test:

 $ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt 0:00.18 $ /usr/bin/time -f '%E' python gutenberg.py > output2.txt 0:00.30

So, as expected, the tcl version is more efficient. However, exiting the python version looks a bit cleaner (no extra spaces are added at the beginning of lines).

The most efficient way to remove extra lines in Python

More articles: