Python beginner - quick way to find and replace in a large file?

I have a file of about 100 million lines in which I want to replace the text with alternative text stored in a tab delimited file. The code I have but takes about an hour to process the first 70K lines. While trying to gradually improve my python skills, I wonder if there is a faster way to do this. Thank you The input file looks something like this:

CHROMOSOME_IV ncRNA gene 5723085 5723105. -. ID = Gene: WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105. -. Parent = Gene: WBGene00045518

and the file with the replacement values โ€‹โ€‹is as follows:

WBGene00045518 21ur-5153

Here is my code:

infile1 = open('f1.txt', 'r') infile2 = open('f2.txt', 'r') outfile = open('out.txt', 'w') import re from datetime import datetime startTime = datetime.now() udict = {} for line in infile1: line = line.strip() linelist = line.split('\t') udict1 = {linelist[0]:linelist[1]} udict.update(udict1) mult10K = [] for x in range(100): mult10K.append(x * 10000) linecounter = 0 for line in infile2: for key, value in udict.items(): matches = line.count(key) if matches > 0: print key, value line = line.replace(key, value) outfile.write(line + '\n') else: outfile.write(line + '\n') linecounter += 1 if linecounter in mult10K: print linecounter print (datetime.now()-startTime) infile1.close() infile2.close() outfile.close() 
+7
source share
6 answers

You should split your lines into โ€œwordsโ€ and only look for these words in the dictionary:

 >>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518") ['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518'] 

This will eliminate the loop over the dictionary that you make for each individual line.

Here is the complete code:

 import re with open("f1.txt", "r") as infile1: udict = dict(line.strip().split("\t", 1) for line in infile1) with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile: for line in infile2: for word in re.findall(r"\w+", line): if word in udict: line = line.replace(word, udict[word]) outfile.write(line) 

Change An alternative approach is to create one mega-regular expression from your dictionary:

 with open("f1.txt", "r") as infile1: udict = dict(line.strip().split("\t", 1) for line in infile1) regex = re.compile("|".join(map(re.escape, udict))) with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile: for line in infile2: outfile.write(regex.sub(lambda m: udict[m.group()], line)) 
+6
source

I thought about your loop over key keys and wqya to optimize this, and let's comment on your code later.

But then I came across this part:

 if linecounter in mult10K: print linecounter print (datetime.now()-startTime) 

This obscene piece actually puts Python in sequence to look at and compare 10,000 elements in your linecounter list for each line of your file.

Replace this part:

 if linecounter % 10000 == 0: print linecounter print (datetime.now()-startTime) 

(And forget the whole part of mult10k), and you should get significant speed.

Also, it looks like you are writing multiple output lines for each line of input - your mainloop looks like this:

 linecounter = 0 for line in infile2: for key, value in udict.items(): matches = line.count(key) if matches > 0: print key, value line = line.replace(key, value) outfile.write(line + '\n') else: outfile.write(line + '\n') linecounter += 1 

Replace it for this:

 for linecounter, line in enumerate(infile2): for key, value in udict.items(): matches = line.count(key) if matches > 0: print key, value line = line.replace(key, value) outfile.write(line + '\n') 

Which properly records only one output line for each input line (in addition to eliminating duplication of code, as well as to account for line counting in a "pythonic" way)

+6
source

This code is full of linear searches. No wonder it works slowly. Without knowing more about input, I cannot give you advice on how to fix these problems, but I can at least point out problems. I will notice the main problems and a couple of minor ones.

 udict = {} for line in infile1: line = line.strip() linelist = line.split('\t') udict1 = {linelist[0]:linelist[1]} udict.update(udict1) 

Do not use update here; just add an item to the dictionary:

  udict[linelist[0]] = linelist[1] 

This will be faster than creating a dictionary for each entry. (And in fact, the Sven Marnach generator-based approach to creating this dictionary is even better.) It's pretty minor though.

 mult10K = [] for x in range(100): mult10K.append(x * 10000) 

This is completely unnecessary. Delete it; I will show you one way to print at intervals without this.

 linecounter = 0 for line in infile2: for key, value in udict.items(): 

This is your first big problem. You do a linear dictionary search for keys in a row for each row. If the dictionary is very large, this will require a huge number of operations: 100,000,000 * len (udict).

  matches = line.count(key) 

This is another problem. You are looking for matches using a linear search. Then you do replace , which does the same linear search! You do not need to check compliance; replace simply returns the same string if it is not. It won't make much difference either, but it will bring you something.

  line = line.replace(key, value) 

Keep doing these replacements, and then just write a line as soon as all the replacements are completed:

  outfile.write(line + '\n') 

And finally

  linecounter += 1 if linecounter in mult10K: 

Forgive me, but this is a funny way to do it! You do a linear search through linecounter to determine when to print the line. And here it adds almost 100,000,000 * 100 operations. You should at least search in the set; but the best approach (if you really have to do this) was to do the modulo operation and test it.

  if not linecounter % 10000: print linecounter print (datetime.now()-startTime) 

To make this code effective, you need to get rid of these linear searches. Sven Marnach's answer offers one way that might work, but I think it depends on the data in your file, since replacement keys may not match obvious word boundaries. (When using regex, he added addresses, though.)

+5
source

This is not specific to Python, but you can deploy a double loop for the loop so that the file does not write to each iteration of the loop. Perhaps write to the file every 1000 or 10,000 lines.

+1
source

I hope that to write an output line for each input time line, the number of replacement lines will be an error, and you really only intended to write one output for each input.

You need to find a way to check input lines for matches as quickly as possible. Going through the entire dictionary is probably your bottleneck.

I believe that regular expressions are precompiled into state machines, which can be highly efficient. I don't know how performance works when creating a huge expression, but it's worth a try.

 freakin_huge_re = re.compile('(' + ')|('.join(udict.keys()) + ')') for line in infile2: matches = [''.join(tup) for tup in freakin_huge_re.findall(line)] if matches: for key in matches: line = line.replace(key, udict[key]) 
+1
source

They are obvious in Python, this list comprehension is a faster (and more readable) way to do this:

 mult10K = [] for x in range(100): mult10K.append(x * 10000) 

like this:

 mult10K = [x*10000 for x in range(100)] 

Similarly, where you have:

 udict = {} for line in infile1: line = line.strip() linelist = line.split('\t') udict1 = {linelist[0]:linelist[1]} udict.update(udict1) 

We can use a dict understanding (with a generator expression):

 lines = (line.strip().split('\t') for line in infile1) udict = {line[0]: line[1] for line in lines} 

It is also worth noting here that you are working with a tab delimited file. In this case , the csv module can be much better than using split() .

Also note that using the with statement improves readability and ensures that your files are closed (even with exceptions).

Print statements also slow down if they execute in each cycle - they are useful for debugging, but when you start on your main piece of data, it is probably worth deleting them.

Another โ€œmore pythonicโ€ thing you can do is use enumerate() rather than adding it to a variable every time. For example:

 linecounter = 0 for line in infile2: ... linecouter += 1 

Can be replaced by:

 for linecounter, line in enumerate(infile2): ... 

If you count key occurrences, the best solution is to use in :

 if key in line: 

Like this short circuit after finding the instance.

Add all this, see what we have:

 import csv from datetime import datetime startTime = datetime.now() with open('f1.txt', 'r') as infile1: reader = csv.reader(delimiter='\t') udict = dict(reader) with open('f2.txt', 'r') as infile2, open('out.txt', 'w') as outfile: for line in infile2: for key, value in udict.items(): if key in line: line = line.replace(key, value) outfile.write(line + '\n') 

Edit: comp vs normal loop list as noted in the comments:

 python -m timeit "[i*10000 for i in range(10000)]" 1000 loops, best of 3: 909 usec per loop python -m timeit "a = []" "for i in range(10000):" " a.append(i)" 1000 loops, best of 3: 1.01 msec per loop 

Note usec vs msec. It is not massive, but it is something.

-one
source

All Articles