Although this is an old question, I recently ran into the same problem.
In my case, duplicate keys are required in both data files, and I need a method that could determine if the merge will be placed in memory before calculation, and if not, change the calculation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'): left_groups = left_frame.groupby(group_by).size() right_groups = right_frame.groupby(group_by).size() left_keys = set(left_groups.index) right_keys = set(right_groups.index) intersection = right_keys & left_keys left_diff = left_keys - intersection right_diff = right_keys - intersection left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]]) right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]]) left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection] sizes += [left_nan * right_nan] left_size = [left_groups[group_name] for group_name in left_diff] right_size = [right_groups[group_name] for group_name in right_diff] if how == 'inner': return sum(sizes) elif how == 'left': return sum(sizes + left_size) elif how == 'right': return sum(sizes + right_size) return sum(sizes + left_size + right_size)
Note:
Currently, using this method, a key can only be a label, not a list. Using the list for group_by currently returns the sum of the merge sizes for each label in the list. This will result in the merge size being much larger than the actual merge size.
If you are using a list of labels for group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if it fits in memory
The merge_size function defined here returns the number of rows that will be created by merging the two data frames.
Multiplying this by the number of columns from both data frames, and then multiplying by the size of np.float [32/64], you can get an approximate idea of ββhow big the resulting framework will be in memory. This can then be compared to psutil.virtual_memory().available to find out if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'): rows = merge_size(df1, df2, key, how) cols = len(df1.columns) + (len(df2.columns) - 1) required_memory = (rows * cols) * np.dtype(np.float64).itemsize return required_memory <= psutil.virtual_memory().available
The merge_size method was proposed as a pandas extension in this problem. https://github.com/pandas-dev/pandas/issues/15068 .