goal
My goal is to combine the two DataFrames by their common column (gene names) so that I can take the product of each gene in each row of genes. Then I performed groupby for patients and cells and summarized all the scores from each. The final data frame should look like this:
patient cell Pat_1 22RV1 12 DU145 15 LN18 9 Pat_2 22RV1 12 DU145 15 LN18 9 Pat_3 22RV1 12 DU145 15 LN18 9
This last part should work fine, but I was not able to perform the first gene name merge due to a MemoryError . Below are snippets of each DataFrame.
Data
cell_s =
Description Name level_2 0 0 LOC100009676 100009676_at LN18_CENTRAL_NERVOUS_SYSTEM 1 1 LOC100009676 100009676_at 22RV1_PROSTATE 2 2 LOC100009676 100009676_at DU145_PROSTATE 3 3 AKT3 10000_at LN18_CENTRAL_NERVOUS_SYSTEM 4 4 AKT3 10000_at 22RV1_PROSTATE 5 5 AKT3 10000_at DU145_PROSTATE 6 6 MED6 10001_at LN18_CENTRAL_NERVOUS_SYSTEM 7 7 MED6 10001_at 22RV1_PROSTATE 8 8 MED6 10001_at DU145_PROSTATE 9
cell_s is about 10,000,000 rows
patient_s =
id level_1 0 0 MED6 Pat_1 1 1 MED6 Pat_2 1 2 MED6 Pat_3 1 3 LOC100009676 Pat_1 2 4 LOC100009676 Pat_2 2 5 LOC100009676 Pat_3 2 6 ABCD Pat_1 3 7 ABCD Pat_2 3 8 ABCD Pat_3 3 ....
patient_s is about 1,200,000 rows
code
def get_score(cell, patient): cell_s = cell.set_index(['Description', 'Name']).stack().reset_index() cell_s.columns = ['Description', 'Name', 'cell', 's1'] patient_s = patient.set_index('id').stack().reset_index() patient_s.columns = ['id', 'patient', 's2']
I was getting a MemoryError when initially read_csv these files, but then specifying dtypes resolved the issue. Confirming that my python is 64 bit also did not fix my problem. I have not reached the limits on pandas, right?
Python 3.4.3 | Anaconda 2.3.0 (64-bit) | Pandas 0.16.2