Waterman-Eggert Implementation

Question

Waterman-Eggert Implementation

I'm trying to implement the Waterman-Eggert algorithm to find suboptimal alignments of local sequences, but I'm trying to figure out how to “skip” individual alignment groups. I have a standard Smith-Waterman algorithm.

A simple test aligning the following sequence:

'HEAGHEAGHEAG' 'HEAGHEAGHEAG'

creates fMatrix as follows:

  [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 8. 0. 0. 0. 8. 0. 0. 0. 8. 0. 0. 0.] [ 0. 0. 13. 0. 0. 0. 13. 0. 0. 0. 13. 0. 0.] [ 0. 0. 0. 17. 0. 0. 0. 17. 0. 0. 0. 17. 0.] [ 0. 0. 0. 0. 23. 0. 0. 0. 23. 0. 0. 0. 23.] [ 0. 8. 0. 0. 0. 31. 0. 0. 0. 31. 0. 0. 0.] [ 0. 0. 13. 0. 0. 0. 36. 0. 0. 0. 36. 0. 0.] [ 0. 0. 0. 17. 0. 0. 0. 40. 0. 0. 0. 40. 0.] [ 0. 0. 0. 0. 23. 0. 0. 0. 46. 0. 0. 0. 46.] [ 0. 8. 0. 0. 0. 31. 0. 0. 0. 54. 4. 0. 0.] [ 0. 0. 13. 0. 0. 0. 36. 0. 0. 4. 59. 9. 0.] [ 0. 0. 0. 17. 0. 0. 0. 40. 0. 0. 9. 63. 13.] [ 0. 0. 0. 0. 23. 0. 0. 0. 46. 0. 0. 13. 69.]]

To find suboptimal alignments like

 'HEAGHEAGHEAG ' ' HEAGHEAGHEAG'

You must first remove the optimal alignment (that is, along the main diagonal) and recount fMatrix; this is called "declumping", where the "clump" of alignments is defined as any alignments whose paths intersect / share one or more pairs of aligned residues. In addition to fMatrix, there is a secondary matrix containing information about the direction in which fMatrix was built.

The code snippet for building the fMatrix matrix and backtracking is as follows:

 # Generates fMatrix. for i in range(1, length): for j in range(1, length): matchScore = fMatrix[i-1][j-1] + simMatrixDict[seq[i-1]+seq[j-1]] insScore = fMatrix[i][j-1] + gap delScore = fMatrix[i-1][j] + gap fMatrix[i][j] = max(0, matchScore, insScore, delScore) # Generates matrix for backtracking. if fMatrix[i][j] == matchScore: backMatrix[i][j] = 2 elif fMatrix[i][j] == insScore: backMatrix[i][j] = 3 # INSERTION in seq - Horizontal elif fMatrix[i][j] == delScore: backMatrix[i][j] = 1 # DELETION in seq - Vertical if fMatrix[i][j] >= backtrackStart: backtrackStart = fMatrix[i][j] endCoords = i, j return fMatrix, backMatrix, endCoords

To eliminate this optimal alignment, I tried to use this backMatrix to return via fMatrix (according to the original Smith-Waterman algorithm) and set fMatrix[i][j] = 0 as I go, but this does not delete the entire cluster, only exact alignment in this cluster.

For some reference information, the Wikipedia page for the Smith-Waterman algorithm explains how fMatrix is built, and there is an explanation here about how backtracking works. The Waterman-Eggert algorithm is roughly described here .

Thanks.

+7

python bioinformatics sequence-alignment

jonwells May 17 '13 at 14:36

source share

1 answer

daviewales · Answer 1 · 2013-05-20T12:58:07+0000

OK Here is the code to do what you want. I used a pretty printed library ( pprint ) to make the result look good. (It looks prettier if the numbers in the matrix have a single digit, but the alignment is a bit confused if there are multi-digit numbers.)

How it works?

Because you only need to change the numbers on the main diagonal, as well as on the diagonals above and below, we just need one for the loop. matrix[i][i] always on the main diagonal, because these are i rows down and i are columns. matrix[i][i-1] always the bottom adjacent diagonal, because these are i rows down and i-1 columns. matrix[i-1][i] always the top adjacent diagonal, because i-1 contains rows, and i contains rows.

 #!/usr/bin/python import pprint matrix = [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,], [ 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0,], [ 0, 0, 13, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0,], [ 0, 0, 0, 17, 0, 0, 0, 17, 0, 0, 0, 17, 0,], [ 0, 0, 0, 0, 23, 0, 0, 0, 23, 0, 0, 0, 23,], [ 0, 8, 0, 0, 0, 31, 0, 0, 0, 31, 0, 0, 0,], [ 0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 36, 0, 0,], [ 0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 40, 0,], [ 0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 46,], [ 0, 8, 0, 0, 0, 31, 0, 0, 0, 54, 4, 0, 0,], [ 0, 0, 13, 0, 0, 0, 36, 0, 0, 4, 59, 9, 0,], [ 0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 9, 63, 13,], [ 0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 13, 69,]] print "Original Matrix" pprint.pprint(matrix) print for i in range(len(matrix)): matrix[i][i] = 0 if (i > 0) and (i < (len(matrix))): matrix[i][i-1] = 0 matrix[i-1][i] = 0 print "New Matrix" pprint.pprint(matrix)

Output:

 Original Matrix [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0], [0, 0, 13, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0], [0, 0, 0, 17, 0, 0, 0, 17, 0, 0, 0, 17, 0], [0, 0, 0, 0, 23, 0, 0, 0, 23, 0, 0, 0, 23], [0, 8, 0, 0, 0, 31, 0, 0, 0, 31, 0, 0, 0], [0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 36, 0, 0], [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 40, 0], [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 46], [0, 8, 0, 0, 0, 31, 0, 0, 0, 54, 4, 0, 0], [0, 0, 13, 0, 0, 0, 36, 0, 0, 4, 59, 9, 0], [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 9, 63, 13], [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 13, 69]] New Matrix [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0], [0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0], [0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 17, 0], [0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0, 0, 23], [0, 8, 0, 0, 0, 0, 0, 0, 0, 31, 0, 0, 0], [0, 0, 13, 0, 0, 0, 0, 0, 0, 0, 36, 0, 0], [0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 40, 0], [0, 0, 0, 0, 23, 0, 0, 0, 0, 0, 0, 0, 46], [0, 8, 0, 0, 0, 31, 0, 0, 0, 0, 0, 0, 0], [0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 0, 0, 0], [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 0, 0], [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 0]]

Waterman-Eggert Implementation

How it works?

Output:

More articles: