How to get frequencies from the number of unique occurrences of paired letters for each possible pair of columns from a numpy matrix in python

I have such a matrix using the numpy matrix:

>>> print matrix [['L' 'G' 'T' 'G' 'A' 'P' 'V' 'I'] ['A' 'A' 'S' 'G' 'P' 'S' 'S' 'G'] ['A' 'A' 'S' 'G' 'P' 'S' 'S' 'G'] ['G' 'L' 'T' 'G' 'A' 'P' 'V' 'I']] 

What I would like to have, FOR EVERY POSSIBLE pair of columns, extracts the frequency of the number of unique occurrences of each pair of letters from a row in each pair of columns.

For example, for the first column of a pair, that is:

 [['L' 'G'] ['A' 'A'] ['A' 'A'] ['G' 'L']] 

I would like to get the frequency of each pair of letters inside a column (NOTE: the order of letters matters)

Frequency ['L' 'G'] = 1/4

Frequency ['A' 'A'] = 2/4

Frequency ['G' 'L'] = 1/4

Once these frequencies of the first column of the pair are calculated, then do the same for any other combination of possible column combinations.

I think some itertools will help solve this issue, but I don’t know how ... any help would be greatly appreciated

+4
source share
2 answers

I would use itertools.combinations and collections.Counter :

 for i, j in itertools.combinations(range(len(sT)), 2): c = s[:, [i,j]] counts = collections.Counter(map(tuple,c)) print 'columns {} and {}'.format(i,j) for k in sorted(counts): print 'Frequency of {} = {}/{}'.format(k, counts[k], len(c)) print 

produces

 columns 0 and 1 Frequency of ('A', 'A') = 2/4 Frequency of ('G', 'L') = 1/4 Frequency of ('L', 'G') = 1/4 columns 0 and 2 Frequency of ('A', 'S') = 2/4 Frequency of ('G', 'T') = 1/4 Frequency of ('L', 'T') = 1/4 [...] 

(Modifying it to execute both columns 0 1 and 1 0 if you want both orders to be trivial, and I assumed that every possible pair of columns does not mean "every adjacent pair of columns").

+6
source

If you have spare memory, for some sizes of your array, I guess several columns and many rows, it can pay off to make a more countless intensive solution:

 >>> rows, cols = matrix.shape >>> matches = np.empty((rows, cols, cols, 2), dtype=str) >>> matches[..., 0] = matrix[:, None, :] >>> matches[..., 1] = matrix[:, :, None] >>> matches = matches.view('S2') >>> matches = matches.reshape((rows, cols, cols)) 

And now in matches[:, i, j] you have unique pairs between columns i and j , and you can do the following:

 >>> unique, idx = np.unique(matches[:, 0, 1], return_inverse=True) >>> counts = np.bincount(idx) >>> unique array(['AA', 'GL', 'LG'], dtype='|S2') >>> counts array([2, 1, 1]) 
0
source

All Articles