Counting records in big data with pandas

I am working with a split tab file:

A    B    C    D
a    d    ii   do 
a    d    g    do
a    h    g    do
a    i    k    mo
c    i    k    mo
c    g    ii   mo
v    g    p    do

I want to read each entry in the first column and all occurrences associated with it in the second, third and fourth columns, for example:

a 4 d 2 h 1 i 1 ii 1 k 1 domain 3 motif 1
c 2 i 1 g 1 k 1 ii 1 motif 2 
v 1 g 1 p 1 motif 1

I am trying to sort this data using python pandas with the following commands:

df = pd.read_csv('file.txt', delimiter= '\t', names = ['A', 'B', 'C', 'D']) 
df1.groupby(['a', 'c', 'd', 'e']).count()

but it does not return the desired results.

+4
source share
2 answers
import pandas as pd
df = pd.DataFrame({'A': ['a', 'a', 'a', 'a', 'c', 'c', 'v'],
                   'B': ['d', 'd', 'h', 'i', 'i', 'g', 'g'],
                   'C': ['ii', 'g', 'g', 'k', 'k', 'ii', 'p'],
                   'D': ['domain', 'domain', 'domain', 'motif', 
                         'motif', 'motif', 'domain']})

melted = pd.melt(df, id_vars='A')
count = melted.groupby(['A', 'value'])['value'].count()
result = count.unstack(fill_value=0)
result['A'] = df.groupby('A')['A'].count()
print(result)

gives

value  d  domain  g  h  i  ii  k  motif  p  A
A                                            
a      2       3  2  1  1   1  1      1  0  4
c      0       0  1  0  1   1  1      2  0  2
v      0       1  1  0  0   0  0      0  1  1

Explanation

  • Use pd.meltto combine all columns (except a column A) into one column:

    In [517]: melted = pd.melt(df, id_vars='A'); melted
    Out[517]: 
        A variable   value
    0   a        B       d
    1   a        B       d
    2   a        B       h
    3   a        B       i
    4   c        B       i
    ...
    
  • Then you can groupby / count columns Aand value:

    In [520]: count = melted.groupby(['A', 'value'])['value'].count(); count
    Out[520]: 
    A  value 
    a  d         2
       domain    3
       g         2
       h         1
    ...
    
  • count.unstack('value')moves the index level valueto the index level of the column:

    In [522]: count.unstack('value', fill_value=0)
    Out[522]: 
    value  d  domain  g  h  i  ii  k  motif  p
    A                                         
    a      2       3  2  1  1   1  1      1  0
    c      0       0  1  0  1   1  1      2  0
    v      0       1  1  0  0   0  0      0  1
    
+4
source
import pandas as pd
df = pd.DataFrame({'A': ['a', 'a', 'a', 'a', 'c', 'c', 'v'],
                   'B': ['d', 'd', 'h', 'i', 'i', 'g', 'g'],
                   'C': ['ii', 'g', 'g', 'k', 'k', 'ii', 'p'],
                   'D': ['domain', 'domain', 'domain', 'motif', 
                         'motif', 'motif', 'domain']})

n = [name for name,g in df.groupby('A')] # remember the index names
d= [[name]*g['A'].count() + g[['B','C','D']].values.flatten().tolist() for name, g in df.groupby('A')]
rslt = pd.DataFrame([dict((x,r.count(x)) for x in r) for r in d]).fillna(0)

rslt['count'] = rslt[n].sum(axis=1)
rslt.set_index(pd.Index(n), inplace=True)
rslt.drop(n, axis=1, inplace=True)

:

  • A flatten . .
d
Out[138]: 
[['a',
  'a',
  'a',
  'a',
  'd',
  'ii',
  'domain',
  'd',
  'g',
  'domain',
  'h',
  'g',
  'domain',
  'i',
  'k',
  'motif'],
 ['c', 'c', 'i', 'k', 'motif', 'g', 'ii', 'motif'],
 ['v', 'g', 'p', 'domain']]
  1. DataFrame. python build-int count, , generator. NaN 0.
pd.DataFrame([dict((x,r.count(x)) for x in r) for r in d]).fillna(0)
Out[141]:
   a  c  d  domain  g  h  i  ii  k  motif  p  v
0  4  0  2       3  2  1  1   1  1      1  0  0
1  0  2  0       0  1  0  1   1  1      2  0  0
2  0  0  0       1  1  0  0   0  0      0  1  1
  1. DataFrame
rslt
Out[143]: 
   d  domain  g  h  i  ii  k  motif  p  count
a  2       3  2  1  1   1  1      1  0      4
c  0       0  1  0  1   1  1      2  0      2
v  0       1  1  0  0   0  0      0  1      1
+1

All Articles