Count strings matching strings and numeric with pandas

Question

Count strings matching strings and numeric with pandas

I have 1-12 numbers in a SAMPLE column, and for each number I am trying to count the number of mutations (A: T, C: G, etc.). This code works, but how can I change this code to give me all 12 conditions for each mutation, instead of writing the same code 12 times, as well as for each mutation?

In this example; AT gives me a number with SAMPLE=1 . I am trying to get the AT number for each sample number (1,2, .. 12). So how can you change this code for this? I will be grateful for any help. Thanks.

  SAMPLE MUT 0 11 chr1:100154376:G:A 1 2 chr1:100177723:C:T 2 9 chr1:100177723:C:T 3 1 chr1:100194200:-:AA 4 8 chr1:10032249:A:G 5 2 chr1:100340787:G:A 6 1 chr1:100349757:A:G 7 3 chr1:10041186:C:A 8 10 chr1:100476986:G:C 9 4 chr1:100572459:C:T 10 5 chr1:100572459:C:T ... ... ... d= df["SAMPLE", "MUT" ] chars1 = "TGC-" number = {} for item in chars1: dm= d[(d["MUT"].str.contains("A:" + item)) & (d["SAMPLE"].isin([1]))] num1 = dm.count() number[item] = num1 AT=number["T"] AG=number["G"] AC=number["C"] A_=number["-"]

+4

python pandas

kant Aug 11 '15 at 21:21

source share

3 answers

You can create a column with the mutation type (A-> T, G-> C) with the replacement of the regular expression, and then use pandas groupby to count.

 import pandas as pd import re df = pd.read_table('df.tsv') df['mutation_type'] = df['MUT'].apply(lambda x: re.sub(r'^.*?:([^:]+:[^:]+)$', r'\1', x)) df.groupby(['SAMPLE','mutation_type']).agg('count')['MUT']

The result is the same for your data:

 SAMPLE mutation_type 1 -:AA 1 A:G 1 2 C:T 1 G:A 1 3 C:A 1 4 C:T 1 5 C:T 1 8 A:G 1 9 C:T 1 10 G:C 1 11 G:A 1 Name: MUT, dtype: int64

+1

AP Aug 11 '15 at 10:08

source share

I had a similar answer on AP

 import pandas as pd df = pd.DataFrame(data={'SAMPLE': [11,2,9,1,8,2,1,3,10,4,5], 'MUT': ['chr1:100154376:G:A', 'chr1:100177723:C:T', 'chr1:100177723:C:T', 'chr1:100194200:-:AA', 'chr1:10032249:A:G', 'chr1:100340787:G:A', 'chr1:100349757:A:G', 'chr1:10041186:C:A', 'chr1:100476986:G:C', 'chr1:100572459:C:T', 'chr1:100572459:C:T']}, columns=['SAMPLE', 'MUT']) df['Sequence'] = df['MUT'].str.replace(r'\w+:\d+:', '\1') df.groupby(['SAMPLE', 'Sequence']).count()

Gives out

  MUT SAMPLE Sequence 1 -:AA 1 A:G 1 2 C:T 1 G:A 1 3 C:A 1 4 C:T 1 5 C:T 1 8 A:G 1 9 C:T 1 10 G:C 1 11 G:A 1

+1

Jarad Aug 11 '15 at 10:19

source share

firelynx · Accepted Answer · 2015-08-12T08:56:24+0000

I would use my own string extraction methods in pandas

 df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')

Returns matches of different groups:

  0 1 2 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN G NaN NaN 5 NaN NaN NaN NaN 6 NaN G NaN NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN 10 NaN NaN NaN NaN

Then I would convert this to True or False with pd.isnull and invert it with ~ . This way you get True where this is a match, and false where not.

 ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')) 0 1 2 3 0 False False False False 1 False False False False 2 False False False False 3 False False False False 4 False True False False 5 False False False False 6 False True False False 7 False False False False 8 False False False False 9 False False False False 10 False False False False

Then assign it to the data file

 df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')) SAMPLE MUT TGC - 0 11 chr1:100154376:G:A False False False False 1 2 chr1:100177723:C:T False False False False 2 9 chr1:100177723:C:T False False False False 3 1 chr1:100194200:-:AA False False False False 4 8 chr1:10032249:A:G False True False False 5 2 chr1:100340787:G:A False False False False 6 1 chr1:100349757:A:G False True False False 7 3 chr1:10041186:C:A False False False False 8 10 chr1:100476986:G:C False False False False 9 4 chr1:100572459:C:T False False False False 10 5 chr1:100572459:C:T False False False False

Now we can just sum the columns:

 df[["T","G","C","-"]].sum() T 0 G 2 C 0 - 0

But wait, we didn’t do this only where SAMPLE == 1

We can do this very easily with a mask:

 sample_one_mask = df.SAMPLE == 1 df[sample_one_mask][["T","G","C","-"]].sum() T 0 G 1 C 0 - 0

If you want this value to be counted for SAMPLE, you can use the groupby function:

 df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int) TGC - SAMPLE 1 0 1 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 8 0 1 0 0 9 0 0 0 0 10 0 0 0 0 11 0 0 0 0

TL; DR;

Do it:

 df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')) df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)

Count strings matching strings and numeric with pandas

More articles: