The best way to combine two large datasets in Pandas

Question

The best way to combine two large datasets in Pandas

I load two data sets from two different databases that need to be combined. Each of them individually is about 500 MB when I store them as CSV. It fits into the memory separately, but when I boot, I sometimes get a memory error. I am definitely having problems when I try to combine them with pandas.

What is the best way to make an external connection on them so that I don't get a memory error? I do not have database servers, but I can install any open source software on my computer if that helps. Ideally, I would still like to allow it only in pandas, but I'm not sure if this is possible at all.

To clarify: by merging, I mean an outer join. Each table has two rows: product and version. I want to check which products and versions are only in the left table, only for the right table and both tables. What am i doing with

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')

+4

python memory-management pandas

Nickpick Jun 10 '16 at 20:51

source share

2 answers

root · Answer 1 · 2016-06-10T21:37:16+0000

This is similar to the task for which it was designed dask. Essentially, it daskcan perform operations pandasdue to the kernel, so you can work with data sets that do not fit into memory. An API dask.dataframeis a subset of an API pandas, so the learning curve should not be significant. See the Dask DataFrame Overview page for some additional DataFrame details.

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

, 'product' 'version' , merge :

df = dd.concat([df1, df2]).drop_duplicates()

, , , -, , , "" dask, .

MaxU · Answer 2 · 2016-06-15T21:24:19+0000

RDBMS, MySQL, ...

, CSV .

:

SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;

SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;

SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;

MySQL, 2

MyISAM , this

Pandas, .

:

.
Apache Spark SQL ( DataFrame) - , RAM.

The best way to combine two large datasets in Pandas

More articles: