Fastest way to remove first and last lines from Python string

I have a python script that, for various reasons, has a variable, which is a fairly large string, say 10 MB long. This line contains several lines.

What is the fastest way to delete the first and last lines of this line? Due to line size, the faster the operation, the better; there is an emphasis on speed. The program returns a slightly smaller line, without the first and last lines.

'\n'.join(string_variable[-1].split('\n')[1:-1]) is the easiest way to do this, but it is very slow because the split () function copies the object to memory , and join () copies it again.

Example line:

 *** START OF DATA *** data data data *** END OF DATA *** 

Extra credit: if this program does not suffocate, if there is no data between them; this is not necessary, since for my case there should not be a row in which there is no data.

+7
performance python string
source share
4 answers

First, divide by '\n' once, and then check if the row in the last index contains '\n' , if yes str.rsplit in '\n' once and select the element in the 0th index, otherwise return an empty row :

 def solve(s): s = s.split('\n', 1)[-1] if s.find('\n') == -1: return '' return s.rsplit('\n', 1)[0] ... >>> s = '''*** START OF DATA *** data data data *** END OF DATA ***''' >>> solve(s) 'data\ndata\ndata' >>> s = '''*** START OF DATA *** *** END OF DATA ***''' >>> solve(s) '' >>> s = '\n'.join(['a'*100]*10**5) >>> %timeit solve(s) 100 loops, best of 3: 4.49 ms per loop 

Or don't split at all, find the index '\n' at either end and cut the line:

 >>> def solve_fast(s): ind1 = s.find('\n') ind2 = s.rfind('\n') return s[ind1+1:ind2] ... >>> s = '''*** START OF DATA *** data data data *** END OF DATA ***''' >>> solve_fast(s) 'data\ndata\ndata' >>> s = '''*** START OF DATA *** *** END OF DATA ***''' >>> solve_fast(s) '' >>> s = '\n'.join(['a'*100]*10**5) >>> %timeit solve_fast(s) 100 loops, best of 3: 2.65 ms per loop 
+9
source share

Consider a string s that looks something like this:

 s = "line1\nline2\nline3\nline4\nline5" 

The following code ...

 s[s.find('\n')+1:s.rfind('\n')] 

... displays the result:

 'line2\nline3\nline4' 

And thus, this is the shortest code to remove the first and last line of a line. I don’t think that the .find and .rfind methods do anything but search for a given string. Try the speed!

+6
source share

Depending on how your use case uses a string, a faster way to delete it may not be deleted.

If you plan to sequentially access the lines in a line, you can create a generator that skips the first and last lines, and each line will be consumed, rather than completely creating a new set of copies of all the lines.

A subscriber way to avoid the first and last line is to iterate over the line without making unnecessary copies, tracking the next three lines and returning only the second, so the iteration will be completed before reaching the last line, without requiring to know the position of the last line break.

The following function should give you the desired result:

 def split_generator(s): # Keep track of start/end positions for three lines start_prev = end_prev = 0 start = end = 0 start_next = end_next = 0 nr_lines = 0 for idx, c in enumerate(s): if c == '\n': nr_lines += 1 start_prev = start end_prev = end start = start_next end = end_next start_next = end_next end_next = idx if nr_lines >= 3: yield s[(start + 1) : end] # Handle the case when input string does not finish on "\n" if s[-1] != '\n' and nr_lines >= 2: yield s[(start_next+1):end_next] 

You cannot verify this:

 print("1st example") for filtered_strs in split_generator('first\nsecond\nthird'): print(filtered_strs) print("2nd example") for filtered_strs in split_generator('first\nsecond\nthird\n'): print(filtered_strs) print("3rd example") for filtered_strs in split_generator('first\nsecond\nthird\nfourth'): print(filtered_strs) print("4th example") for filtered_strs in split_generator('first\nsecond\nthird\nfourth\n'): print(filtered_strs) print("5th example") for filtered_strs in split_generator('first\nsecond\nthird\nfourth\nfifth'): print(filtered_strs) 

Will generate output:

 1st example second 2nd example second 3rd example second third 4th example second third 5th example second third fourth 

Please note that the biggest advantage of this approach is that only one new line will be created at this time and there will be practically no time to create the first line of output (instead of waiting until all lines are found before continuing) but, again it may or may not be useful depending on your use case.

0
source share

Another method is to split the data into new lines and then reunite everything except the first and last line:

 >>> s = '*** START OF DATA *** \n\ ... data\n\ ... data\n\ ... data\n\ ... *** END OF DATA ***' >>> '\n'.join(s.split('\n')[1:-1]) 'data\ndata\ndata' 

This works fine without data:

 >>> s = '*** START OF DATA *** \n\ ... *** END OF DATA ***' >>> '\n'.join(s.split('\n')[1:-1]) '' 
0
source share

All Articles