I would use my own string extraction methods in pandas
df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')
Returns matches of different groups:
0 1 2 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN G NaN NaN 5 NaN NaN NaN NaN 6 NaN G NaN NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN 10 NaN NaN NaN NaN
Then I would convert this to True or False with pd.isnull and invert it with ~ . This way you get True where this is a match, and false where not.
~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')) 0 1 2 3 0 False False False False 1 False False False False 2 False False False False 3 False False False False 4 False True False False 5 False False False False 6 False True False False 7 False False False False 8 False False False False 9 False False False False 10 False False False False
Then assign it to the data file
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')) SAMPLE MUT TGC - 0 11 chr1:100154376:G:A False False False False 1 2 chr1:100177723:C:T False False False False 2 9 chr1:100177723:C:T False False False False 3 1 chr1:100194200:-:AA False False False False 4 8 chr1:10032249:A:G False True False False 5 2 chr1:100340787:G:A False False False False 6 1 chr1:100349757:A:G False True False False 7 3 chr1:10041186:C:A False False False False 8 10 chr1:100476986:G:C False False False False 9 4 chr1:100572459:C:T False False False False 10 5 chr1:100572459:C:T False False False False
Now we can just sum the columns:
df[["T","G","C","-"]].sum() T 0 G 2 C 0 - 0
But wait, we didn’t do this only where SAMPLE == 1
We can do this very easily with a mask:
sample_one_mask = df.SAMPLE == 1 df[sample_one_mask][["T","G","C","-"]].sum() T 0 G 1 C 0 - 0
If you want this value to be counted for SAMPLE, you can use the groupby function:
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int) TGC - SAMPLE 1 0 1 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 8 0 1 0 0 9 0 0 0 0 10 0 0 0 0 11 0 0 0 0
TL; DR;
Do it:
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')) df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)