Let's say I have the following DataFrame with the original input and you want to process it using the pandas function chain ("pipeline"). In particular, I want to rename and delete columns and add an additional column based on another.
Gene stable ID Gene name Gene type miRBase accession miRBase ID 0 ENSG00000274494 MIR6832 miRNA MI0022677 hsa-mir-6832 1 ENSG00000283386 MIR4659B miRNA MI0017291 hsa-mir-4659b 2 ENSG00000221456 MIR1202 miRNA MI0006334 hsa-mir-1202 3 ENSG00000199102 MIR302C miRNA MI0000773 hsa-mir-302c
I am currently doing the following (which works):
tmp_df = df.\ drop("Gene type", axis=1).\ rename(columns = { "Gene stable ID": "ENSG", "Gene name": "gene_name", "miRBase accession": "MI", "miRBase ID": "mirna_name" }) result = tmp_df.assign(species = tmp_df.mirna_name.str[:3])
result:
ENSG gene_name MI mirna_name species 0 ENSG00000274494 MIR6832 MI0022677 hsa-mir-6832 hsa 1 ENSG00000283386 MIR4659B MI0017291 hsa-mir-4659b hsa 2 ENSG00000221456 MIR1202 MI0006334 hsa-mir-1202 hsa 3 ENSG00000199102 MIR302C MI0000773 hsa-mir-302c hsa
Is it possible to put the assign command directly into the 'pipeline'? It is very difficult to assign an additional temporary variable. I have no idea how I should refer to the corresponding renamed column ("mirna_name") in this case.
python pandas
Gregor sturm
source share