How to get frequencies from the number of unique occurrences of paired letters for each possible pair of columns from a numpy matrix in python

Question

How to get frequencies from the number of unique occurrences of paired letters for each possible pair of columns from a numpy matrix in python

I have such a matrix using the numpy matrix:

>>> print matrix [['L' 'G' 'T' 'G' 'A' 'P' 'V' 'I'] ['A' 'A' 'S' 'G' 'P' 'S' 'S' 'G'] ['A' 'A' 'S' 'G' 'P' 'S' 'S' 'G'] ['G' 'L' 'T' 'G' 'A' 'P' 'V' 'I']]

What I would like to have, FOR EVERY POSSIBLE pair of columns, extracts the frequency of the number of unique occurrences of each pair of letters from a row in each pair of columns.

For example, for the first column of a pair, that is:

 [['L' 'G'] ['A' 'A'] ['A' 'A'] ['G' 'L']]

I would like to get the frequency of each pair of letters inside a column (NOTE: the order of letters matters)

Frequency ['L' 'G'] = 1/4
Frequency ['A' 'A'] = 2/4
Frequency ['G' 'L'] = 1/4

Once these frequencies of the first column of the pair are calculated, then do the same for any other combination of possible column combinations.

I think some itertools will help solve this issue, but I don’t know how ... any help would be greatly appreciated

+4

python numpy itertools

Àngel Ba Mar 03 '13 at 15:44

source share

2 answers

If you have spare memory, for some sizes of your array, I guess several columns and many rows, it can pay off to make a more countless intensive solution:

 >>> rows, cols = matrix.shape >>> matches = np.empty((rows, cols, cols, 2), dtype=str) >>> matches[..., 0] = matrix[:, None, :] >>> matches[..., 1] = matrix[:, :, None] >>> matches = matches.view('S2') >>> matches = matches.reshape((rows, cols, cols))

And now in matches[:, i, j] you have unique pairs between columns i and j , and you can do the following:

 >>> unique, idx = np.unique(matches[:, 0, 1], return_inverse=True) >>> counts = np.bincount(idx) >>> unique array(['AA', 'GL', 'LG'], dtype='|S2') >>> counts array([2, 1, 1])

0

Jaime Mar 03 '13 at 21:07

source share

DSM · Accepted Answer · 2013-03-03T16:00:02+0000

I would use itertools.combinations and collections.Counter :

 for i, j in itertools.combinations(range(len(sT)), 2): c = s[:, [i,j]] counts = collections.Counter(map(tuple,c)) print 'columns {} and {}'.format(i,j) for k in sorted(counts): print 'Frequency of {} = {}/{}'.format(k, counts[k], len(c)) print

produces

 columns 0 and 1 Frequency of ('A', 'A') = 2/4 Frequency of ('G', 'L') = 1/4 Frequency of ('L', 'G') = 1/4 columns 0 and 2 Frequency of ('A', 'S') = 2/4 Frequency of ('G', 'T') = 1/4 Frequency of ('L', 'T') = 1/4 [...]

(Modifying it to execute both columns 0 1 and 1 0 if you want both orders to be trivial, and I assumed that every possible pair of columns does not mean "every adjacent pair of columns").

How to get frequencies from the number of unique occurrences of paired letters for each possible pair of columns from a numpy matrix in python

More articles: