Python Pandas Memory Overflow Merge

Question

Python Pandas Memory Overflow Merge

I am new to Pandas and trying to combine several subsets of data. I give a specific case when this happens, but the question is general: how / why does this happen and how can I get around it?

The download data is about 85 megabytes or so, but I often watch my python session start about 10 gigabytes of memory and then give a memory error.

I have no idea why this happens, but it kills me, because I can’t even start looking at the data the way I want.

Here is what I did:

Import Master Data

import requests, zipfile, StringIO import numpy as np import pandas as pd STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip" STAR2013fileName = 'ca2013_all_csv_v3.txt' r = requests.get(STAR2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STAR2013=pd.read_csv(z.open(STAR2013fileName))

Import some Cross Cross reference tables

 STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip" STARentityList2013fileName = "ca2013entities_csv.txt" r = requests.get(STARentityList2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName)) STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip" STARlookUpTestID2013fileName = "Tests.txt" r = requests.get(STARlookUpTestID2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName)) STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip" STARlookUpSubgroupID2013fileName = "Subgroups.txt" r = requests.get(STARlookUpSubgroupID2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))

Rename merge column ID

 STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'}) STARlookUpSubgroupID2013

Successful merge

 merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')

Try the second merge. Memory overflow occurs here

 merged=pd.merge(merged, STARentityList2013, on='School Code')

I did all of this in an ipython laptop, but I don't think this is changing anything.

+7

python merge pandas memory out-of-memory

pefmath Sep 23 '15 at 23:14

source share

1 answer

mplf · Answer 1 · 2017-01-06T11:46:07+0000

Although this is an old question, I recently ran into the same problem.

In my case, duplicate keys are required in both data files, and I need a method that could determine if the merge will be placed in memory before calculation, and if not, change the calculation method.

The method I came up with is as follows:

Calculate merge size:

 def merge_size(left_frame, right_frame, group_by, how='inner'): left_groups = left_frame.groupby(group_by).size() right_groups = right_frame.groupby(group_by).size() left_keys = set(left_groups.index) right_keys = set(right_groups.index) intersection = right_keys & left_keys left_diff = left_keys - intersection right_diff = right_keys - intersection left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]]) right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]]) left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection] sizes += [left_nan * right_nan] left_size = [left_groups[group_name] for group_name in left_diff] right_size = [right_groups[group_name] for group_name in right_diff] if how == 'inner': return sum(sizes) elif how == 'left': return sum(sizes + left_size) elif how == 'right': return sum(sizes + right_size) return sum(sizes + left_size + right_size)

Note:

Currently, using this method, a key can only be a label, not a list. Using the list for group_by currently returns the sum of the merge sizes for each label in the list. This will result in the merge size being much larger than the actual merge size.

If you are using a list of labels for group_by, the final row size is:

 min([merge_size(df1, df2, label, how) for label in group_by])

Check if it fits in memory

The merge_size function defined here returns the number of rows that will be created by merging the two data frames.

Multiplying this by the number of columns from both data frames, and then multiplying by the size of np.float [32/64], you can get an approximate idea of how big the resulting framework will be in memory. This can then be compared to psutil.virtual_memory().available to find out if your system can calculate the full merge.

 def mem_fit(df1, df2, key, how='inner'): rows = merge_size(df1, df2, key, how) cols = len(df1.columns) + (len(df2.columns) - 1) required_memory = (rows * cols) * np.dtype(np.float64).itemsize return required_memory <= psutil.virtual_memory().available

The merge_size method was proposed as a pandas extension in this problem. https://github.com/pandas-dev/pandas/issues/15068 .

Python Pandas Memory Overflow Merge

Calculate merge size:

Note:

Check if it fits in memory

More articles: