Python script speed improvement

Question

Python script speed improvement

I have an input file with a list of strings.

I repeat every fourth line, starting from line 2.

From each of these lines I create a new line from the first and last 6 characters and put it in the output file only if this new line is unique.

The code I wrote for this works, but I work with very large deep sequencing files and works throughout the day and has not made much progress. Therefore, I am looking for any suggestions to make it much faster, if possible. Thanks.

def method(): target = open(output_file, 'w') with open(input_file, 'r') as f: lineCharsList = [] for line in f: #Make string from first and last 6 characters of a line lineChars = line[0:6]+line[145:151] if not (lineChars in lineCharsList): lineCharsList.append(lineChars) target.write(lineChars + '\n') #If string is unique, write to output file for skip in range(3): #Used to step through four lines at a time try: check = line #Check for additional lines in file next(f) except StopIteration: break target.close()

+5

python

The nightman Jul 9 '15 at 2:10

source share

4 answers

You can use https://docs.python.org/2/library/itertools.html#itertools.islice :

 import itertools def method(): with open(input_file, 'r') as inf, open(output_file, 'w') as ouf: seen = set() for line in itertools.islice(inf, None, None, 4): s = line[:6]+line[-6:] if s not in seen: seen.add(s) ouf.write("{}\n".format(s))

+5

Ding Jul 9 '15 at 2:32

source share

Besides using set as the proposed Oscar, you can also use islice to skip lines, rather than using a for loop.

As stated in this post , islice preprocesses the iterator in C, so it should be much faster than using a simple vanilla python for a loop.

+2

lightalchemist Jul 9 '15 at 2:35

source share

Try replacing

lineChars = line[0:6]+line[145:151]

with

lineChars = ''.join([line[0:6], line[145:151]])

as it may be more effective, depending on the circumstances.

+1

Doug Jul 9 '15 at 3:12

source share

Óscar López · Accepted Answer · 2015-07-09T02:15:50+0000

Try defining lineCharsList as set instead of list:

 lineCharsList = set() ... lineCharsList.add(lineChars)

This will improve the performance of the in operator. Also, if memory is not a problem at all, you may need to copy all the output to a list and write everything at the end, instead of doing a few write() operations.

Python script speed improvement

More articles: