Using excessive Python memory with a simple script

Question

Using excessive Python memory with a simple script

I am writing a very simple script that will count the number of occurrences in a file. The file size is about 300 MB (15 million lines) and has 3 columns. Since I am reading the file line by line, I do not expect python to use a lot of memory. The maximum will be just above 300 MB for storing the count counter.

However, when I look at the activity monitor, the memory usage exceeds 1.5 GB. What am I doing wrong? If this is normal, can someone please explain? Thanks

import csv
def get_counts(filepath):
    with open(filepath,'rb') as csvfile:
        reader = csv.DictReader(csvfile, fieldnames=['col1','col2','col3'], delimiter=',')
        counts = {}
        for row in reader:

            key1 = int(row['col1'])
            key2 = int(row['col2'])

            if (key1, key2) in counts:
                counts[key1, key2] += 1
            else:
                counts[key1, key2] = 1

    return counts

+4

python memory

Romain Apr 08 '16 at 11:54

source share

3 answers

- :

import csv

def get_counts(filepath):

    data = csv.reader(open(filepath), delimiter=',')
    # Remove the first line if headers
    fields = data.next()
    counts = {}

    [count[row[0], row[1]] = count.get((row[0], row[1]), 0) + 1 for row in data]

    return counts

0

Till 08 . '16 12:18

try it

from collection import Counter
import csv

myreader = csv.reader( open(filename, 'r'))
Counter([each[:-1] for row in myreader] )

Hope this helps.

0

Sampath Apr 08 '16 at 12:27

source share

warvariuc · Accepted Answer · 2016-04-08T18:15:42+0000

I think it's ok that Python uses so much memory in your case. Here is a test on my machine:

Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
>>> file_size = 300000000
>>> column_count = 10
>>> average_string_size = 10
>>> row_count = file_size / (column_count * average_string_size)
>>> row_count
3000000
>>> import os, psutil, cPickle
>>> mem1 = psutil.Process(os.getpid()).memory_info().rss
>>> data = [{column_no: '*' * average_string_size for column_no in xrange(column_count)} for row_no in xrange(row_count)]
>>> mem2 = psutil.Process(os.getpid()).memory_info().rss
>>> mem2 - mem1
4604071936L
>>>

, 3000000 dict 10 10 4 .

, csv . counts.

, dicts, csv , ( ).

, , https://pypi.python.org/pypi/memory_profiler

P.S. ,

        if (key1, key2) in counts:
            counts[key1, key2] += 1
        else:
            counts[key1, key2] = 1

Do

from collections import defaultdict
...
counts = defaultdict(int)
...
counts[(key1, key2)] += 1

Using excessive Python memory with a simple script

More articles: