Pandas DataFrame.merge MemoryError

goal

My goal is to combine the two DataFrames by their common column (gene names) so that I can take the product of each gene in each row of genes. Then I performed groupby for patients and cells and summarized all the scores from each. The final data frame should look like this:

  patient cell Pat_1 22RV1 12 DU145 15 LN18 9 Pat_2 22RV1 12 DU145 15 LN18 9 Pat_3 22RV1 12 DU145 15 LN18 9 

This last part should work fine, but I was not able to perform the first gene name merge due to a MemoryError . Below are snippets of each DataFrame.

Data

cell_s =

  Description Name level_2 0 0 LOC100009676 100009676_at LN18_CENTRAL_NERVOUS_SYSTEM 1 1 LOC100009676 100009676_at 22RV1_PROSTATE 2 2 LOC100009676 100009676_at DU145_PROSTATE 3 3 AKT3 10000_at LN18_CENTRAL_NERVOUS_SYSTEM 4 4 AKT3 10000_at 22RV1_PROSTATE 5 5 AKT3 10000_at DU145_PROSTATE 6 6 MED6 10001_at LN18_CENTRAL_NERVOUS_SYSTEM 7 7 MED6 10001_at 22RV1_PROSTATE 8 8 MED6 10001_at DU145_PROSTATE 9 

cell_s is about 10,000,000 rows

patient_s =

  id level_1 0 0 MED6 Pat_1 1 1 MED6 Pat_2 1 2 MED6 Pat_3 1 3 LOC100009676 Pat_1 2 4 LOC100009676 Pat_2 2 5 LOC100009676 Pat_3 2 6 ABCD Pat_1 3 7 ABCD Pat_2 3 8 ABCD Pat_3 3 .... 

patient_s is about 1,200,000 rows

code

 def get_score(cell, patient): cell_s = cell.set_index(['Description', 'Name']).stack().reset_index() cell_s.columns = ['Description', 'Name', 'cell', 's1'] patient_s = patient.set_index('id').stack().reset_index() patient_s.columns = ['id', 'patient', 's2'] # fails here: merged = cell_s.merge(patient_s, left_on='Description', right_on='id') merged['score'] = merged.s1 * merged.s2 scores = merged.groupby(['patient','cell'])['score'].sum() return scores 

I was getting a MemoryError when initially read_csv these files, but then specifying dtypes resolved the issue. Confirming that my python is 64 bit also did not fix my problem. I have not reached the limits on pandas, right?

Python 3.4.3 | Anaconda 2.3.0 (64-bit) | Pandas 0.16.2

+6
source share
2 answers

Consider two workarounds:

CSV by CHUNKS

Apparently read_csv might suffer from performance issues, so large files should load in iterated chunks.

 cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv' tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000) cell_s = pd.concat(tp, ignore_index=True) patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv' tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000) patient_s = pd.concat(tp, ignore_index=True) 

CSV VIA SQL

As a database guy, I always recommend handling large data loads and joining / joining the SQL relational engine, which scales well for such processes. I wrote a lot of comments about merging dataframe Q / As for this effect - even in R. You can use any SQL database, including dbs (Access, SQLite) or dbs client server (MySQL, MSSQL or others), even where your dfs are obtained. Python supports the built-in library for SQLite (otherwise you use ODBC); and dataframes can be placed in databases as tables using pandas to_sql :

 import sqlite3 dbfile = 'C:\\Path\To\SQlitedb.sqlite' cxn = sqlite3.connect(dbfile) c = cxn.cursor() cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace') patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace') strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;' # MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION merged = pd.read_sql(strSQL, cxn) 
+5
source

You may have to do it in pieces or look into the fire. http://blaze.pydata.org

+1
source

All Articles