Python script speed improvement

I have an input file with a list of strings.

I repeat every fourth line, starting from line 2.

From each of these lines I create a new line from the first and last 6 characters and put it in the output file only if this new line is unique.

The code I wrote for this works, but I work with very large deep sequencing files and works throughout the day and has not made much progress. Therefore, I am looking for any suggestions to make it much faster, if possible. Thanks.

def method(): target = open(output_file, 'w') with open(input_file, 'r') as f: lineCharsList = [] for line in f: #Make string from first and last 6 characters of a line lineChars = line[0:6]+line[145:151] if not (lineChars in lineCharsList): lineCharsList.append(lineChars) target.write(lineChars + '\n') #If string is unique, write to output file for skip in range(3): #Used to step through four lines at a time try: check = line #Check for additional lines in file next(f) except StopIteration: break target.close() 
+5
source share
4 answers

Try defining lineCharsList as set instead of list:

 lineCharsList = set() ... lineCharsList.add(lineChars) 

This will improve the performance of the in operator. Also, if memory is not a problem at all, you may need to copy all the output to a list and write everything at the end, instead of doing a few write() operations.

+6
source

You can use https://docs.python.org/2/library/itertools.html#itertools.islice :

 import itertools def method(): with open(input_file, 'r') as inf, open(output_file, 'w') as ouf: seen = set() for line in itertools.islice(inf, None, None, 4): s = line[:6]+line[-6:] if s not in seen: seen.add(s) ouf.write("{}\n".format(s)) 
+5
source

Besides using set as the proposed Oscar, you can also use islice to skip lines, rather than using a for loop.

As stated in this post , islice preprocesses the iterator in C, so it should be much faster than using a simple vanilla python for a loop.

+2
source

Try replacing

lineChars = line[0:6]+line[145:151]

with

lineChars = ''.join([line[0:6], line[145:151]])

as it may be more effective, depending on the circumstances.

+1
source

All Articles