Summing Python frequencies in a file

I have a large file (950 MB) that combines words and frequencies as follows: one per line:

word1 54

word2 1

word3 12

word4 3

word1 99

word4 147

word1 4

word2 6

etc...

I need to sum the frequencies for words, for example word1 = 54 + 99 + 4 = 157, and put this in a list / file. What is the most efficient way to do this in Python?

What I was trying to do was create a list in which each line would be a tuple in that list, summing up from there, it crashed my laptop ...

+4
source share
3 answers

Try the following:

from collections import defaultdict d = defaultdict(int) with open('file') as fh: for line in fh: word, count = line.split() d[word] += count 
+5
source

You do not have to read the entire file in memory . You can also split the file into several smaller files, process each file separately and combine the results / frequencies.

0
source

950MB should not be too large for most modern machines to store in memory. I have done this many times in Python programs, and my machine has 4 GB of physical memory. I can imagine doing the same with less memory too.

You definitely don't want to waste memory if you can avoid it. The previous message is mentioned, processing the file line by line and accumulating the result, which is the right way to do this.

If you do not immediately read the entire file in memory, you only need to worry about how much memory your accumulated result occupies, and not the file itself. You can process files much more than you mentioned, if the result that you store in memory does not become too large. If this is the case, then you will want to start saving partial results as the files themselves, but it does not seem like this problem requires it.

Here the simplest solution to your problem is possible:

 f = open('myfile.txt') result = {} for line in f: word, count = line.split() result[word] = int(count) + result.get(word, 0) f.close() print '\n'.join(result.items()) 

If you are running Linux or another UNIX-like OS, use top to monitor memory usage while the program is running.

0
source

All Articles