Assign pandas to the pipeline

Question

Assign pandas to the pipeline

Let's say I have the following DataFrame with the original input and you want to process it using the pandas function chain ("pipeline"). In particular, I want to rename and delete columns and add an additional column based on another.

Gene stable ID Gene name Gene type miRBase accession miRBase ID 0 ENSG00000274494 MIR6832 miRNA MI0022677 hsa-mir-6832 1 ENSG00000283386 MIR4659B miRNA MI0017291 hsa-mir-4659b 2 ENSG00000221456 MIR1202 miRNA MI0006334 hsa-mir-1202 3 ENSG00000199102 MIR302C miRNA MI0000773 hsa-mir-302c

I am currently doing the following (which works):

 tmp_df = df.\ drop("Gene type", axis=1).\ rename(columns = { "Gene stable ID": "ENSG", "Gene name": "gene_name", "miRBase accession": "MI", "miRBase ID": "mirna_name" }) result = tmp_df.assign(species = tmp_df.mirna_name.str[:3])

result:

  ENSG gene_name MI mirna_name species 0 ENSG00000274494 MIR6832 MI0022677 hsa-mir-6832 hsa 1 ENSG00000283386 MIR4659B MI0017291 hsa-mir-4659b hsa 2 ENSG00000221456 MIR1202 MI0006334 hsa-mir-1202 hsa 3 ENSG00000199102 MIR302C MI0000773 hsa-mir-302c hsa

Is it possible to put the assign command directly into the 'pipeline'? It is very difficult to assign an additional temporary variable. I have no idea how I should refer to the corresponding renamed column ("mirna_name") in this case.

+7

python pandas

Gregor sturm Jun 19 '17 at 12:59

source share

3 answers

 result = df.drop("Gene type", axis=1).\ rename(columns = { "Gene stable ID": "ENSG", "Gene name": "gene_name", "miRBase accession": "MI", "miRBase ID": "mirna_name" }).assign(species = df['miRBase ID'].str[:3])

You can reference the renamed column as df [column_name].

+1

sowmya Jun 19 '17 at 13:27

source share

I found pandas-ply that introduces the magic X character for this purpose:

 import pandas as pd from pandas_ply import X, install_ply install_ply(pd) df\ .drop("Gene type", axis=1)\ .rename(columns = { "Gene stable ID": "ENSG", "Gene name": "gene_name", "miRBase accession": "MI", "miRBase ID": "mirna_name" })\ .ply_select("*", species = X.mirna_name.str[:3])

it would be nice to have this in native pandas.

0

Gregor sturm Jul 14 '17 at 8:01

source share

Allen · Accepted Answer · 2017-06-19T13:27:31+0000

You can use pipe:

 tmp_df = df.\ drop("Gene type", axis=1).\ rename(columns = { "Gene stable ID": "ENSG", "Gene name": "gene_name", "miRBase accession": "MI", "miRBase ID": "mirna_name" }).\ pipe(lambda x: x.assign(species = x.mirna_name.str[:3])) tmp_df Out[365]: ENSG gene_name MI mirna_name species 0 ENSG00000274494 MIR6832 MI0022677 hsa-mir-6832 hsa 1 ENSG00000283386 MIR4659B MI0017291 hsa-mir-4659b hsa 2 ENSG00000221456 MIR1202 MI0006334 hsa-mir-1202 hsa 3 ENSG00000199102 MIR302C MI0000773 hsa-mir-302c hsa

As @Tom noted, this can also be done without using a channel in this case:

 df.\ drop("Gene type", axis=1).\ rename(columns = { "Gene stable ID": "ENSG", "Gene name": "gene_name", "miRBase accession": "MI", "miRBase ID": "mirna_name" }).\ assign(species = lambda x: x.mirna_name.str[:3])

Assign pandas to the pipeline

More articles: