How can I decide how to get back to the Smith-Waterman algorithm?

Question

How can I decide how to get back to the Smith-Waterman algorithm?

I am trying to implement local sequence alignment in Python using the Smith-Waterman algorithm.

Here is what I still have. This comes to building a similarity matrix :

import sys, string from numpy import * f1=open(sys.argv[1], 'r') seq1=f1.readline() f1.close() seq1=string.strip(seq1) f2=open(sys.argv[2], 'r') seq2=f2.readline() f2.close() seq2=string.strip(seq2) a,b =len(seq1),len(seq2) penalty=-1; point=2; #generation of matrix for local alignment p=zeros((a+1,b+1)) # table calculation and matrix generation for i in range(1,a+1): for j in range(1,b+1): vertical_score =p[i-1][j]+penalty; horizontal_score= p[i][j-1]+penalty; if seq1[i-1]==seq2[j-1]: diagonal_score =p[i-1][j-1]+point; else: diagonal_score = p[i-1][j-1]+penalty; p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); print p

For example, with two sequences:

 agcacact acacacta

my code outputs a similarity matrix:

 [[ 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 2. 1. 2. 1. 2. 1. 0. 2.] [ 0. 1. 1. 1. 1. 1. 1. 0. 1.] [ 0. 0. 3. 2. 3. 2. 3. 2. 1.] [ 0. 2. 2. 5. 4. 5. 4. 3. 4.] [ 0. 1. 4. 4. 7. 6. 7. 6. 5.] [ 0. 2. 3. 6. 6. 9. 8. 7. 8.] [ 0. 1. 4. 5. 8. 8. 11. 10. 9.] [ 0. 0. 3. 4. 7. 7. 10. 13. 12.]]

Now I'm stuck in the next step of the algorithm, backtracking, to build the best alignment.

Wikipedia says

To get the optimal local alignment, we start with the highest value in the matrix (i, j). Then we return to one of the positions (i - 1, j), (i, j - 1) and (i - 1, j - 1) depending on the direction of motion used to build the matrix. We keep this process until we reach the matrix cell with a zero value or a value in position (0, 0).

I’m having trouble deciding which position to return to. What does Wikipedia mean under “depending on the direction of motion used to build the matrix”, and how can I implement this in Python?

Finally i did it

 import sys, string from numpy import* import re # read first sequence fasta_sequence1=open(sys.argv[1], 'r') seq1="" for i in fasta_sequence1: if i.startswith(">"): pass else: seq1 = seq1 + i.strip() fasta_sequence1.close() fasta_sequence2=open(sys.argv[2], 'r') seq2 = "" for i in fasta_sequence2: if i.startswith('>'): pass else: seq2 = seq2+ i.strip() fasta_sequence2.close() a,b =len(seq1),len(seq2) penalty=-1; point=2; #generation of matrix for local alignment p=zeros((a+1,b+1)) #intialization of max score max_score=0; #pointer to store the traceback path pointer=zeros((a+1,b+1)) # table calculation and matrix generation for i in range(1,a+1): for j in range(1,b+1): vertical_score =p[i-1][j]+penalty; horizontal_score= p[i][j-1]+penalty; if seq1[i-1]==seq2[j-1]: diagonal_score =p[i-1][j-1]+point; else: diagonal_score = p[i-1][j-1]+penalty; for i in range(1,a+1): for j in range(1,b+1): p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); for i in range(1,a+1): for j in range(1,b+1): if p[i][j]==0: pointer[i][j]=0; #0 means end of the path if p[i][j]==vertical_score: pointer[i][j]=1; #1 means trace up if p[i][j]==horizontal_score: pointer[i][j]=2; #2 means trace left if p[i][j]==diagonal_score: pointer[i][j]=3; #3 means trace diagonal if p[i][j]>=max_score: maximum_i=i; maximum_j=j; max_score=p[i][j]; #for i in range(1,a+1): # for j in range(1,b+1): #if p[i][j]>max_score: #max_score=p[i][j] print max_score # traceback of all the pointers align1,align2='',''; #initial sequences i,j=max_i,max_j; #indices of path starting point while pointer[i][j]!=0: if pointer[i][j]==3: align1=align1+seq1[i-1]; align2=align2+seq2[j-1]; i=i-1; j=j-1; elif pointer[i][j]==2: align1=align1+'-'; align2=align2+seq2[j-1] j=j-1; elif pointer[i][j]==1: align1=align1+seq1[i-1]; align2=align2+'-'; i=i-1; align1=align1[::-1]; #reverse sequence 1 align2=align2[::-1]; #reverse sequence 2 #output_file = open(sys.argv[3],"w") #output_file.write(align1) #output_file.write(align2) print align1 print align2

But I think there might be a better and more efficient way to do this

 output_file = open(sys.argv[3],"w") output_file.write(align1) output_file.write(align2)

the result shows that

 agcacacta-cacact

opposite c for:

 print align1 print align2

shows the correct result:

 agcacact a-cacact

How can I get the above output in a file entry. Thanks

+3

python numpy bioinformatics

Nodnin 01 Oct '12 at 1:10

source share

1 answer

Gareth rees · Accepted Answer · 2012-10-01T10:47:06+0000

When you create a similarity matrix, you need to save not only the similarity index, but also that result. You currently have a line of code:

 p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score);

therefore, here you need to remember not only the maximum score, but which one was the maximum. Then, when you begin the retreat, you will know which direction to go.

For example, you can try something like this:

 import numpy DELETION, INSERTION, MATCH = range(3) def smith_waterman(seq1, seq2, insertion_penalty = -1, deletion_penalty = -1, mismatch_penalty = -1, match_score = 2): """ Find the optimum local sequence alignment for the sequences `seq1` and `seq2` using the Smith-Waterman algorithm. Optional keyword arguments give the gap-scoring scheme: `insertion_penalty` penalty for an insertion (default: -1) `deletion_penalty` penalty for a deletion (default: -1) `mismatch_penalty` penalty for a mismatch (default: -1) `match_score` score for a match (default: 2) See <http://en.wikipedia.org/wiki/Smith-Waterman_algorithm>. >>> for s in smith_waterman('AGCAGACT', 'ACACACTA'): print s ... AGCAGACT- A-CACACTA """ m, n = len(seq1), len(seq2) # Construct the similarity matrix in p[i][j], and remember how # we constructed it -- insertion, deletion or (mis)match -- in # q[i][j]. p = numpy.zeros((m + 1, n + 1)) q = numpy.zeros((m + 1, n + 1)) for i in range(1, m + 1): for j in range(1, n + 1): deletion = (p[i - 1][j] + deletion_penalty, DELETION) insertion = (p[i][j - 1] + insertion_penalty, INSERTION) if seq1[i - 1] == seq2[j - 1]: match = (p[i - 1][j - 1] + match_score, MATCH) else: match = (p[i - 1][j - 1] + mismatch_penalty, MATCH) p[i][j], q[i][j] = max((0, 0), deletion, insertion, match) # Yield the aligned sequences one character at a time in reverse # order. def backtrack(): i, j = m, n while i > 0 or j > 0: assert i >= 0 and j >= 0 if q[i][j] == MATCH: i -= 1 j -= 1 yield seq1[i], seq2[j] elif q[i][j] == INSERTION: j -= 1 yield '-', seq2[j] elif q[i][j] == DELETION: i -= 1 yield seq1[i], '-' else: assert(False) return [''.join(reversed(s)) for s in zip(*backtrack())]

How can I decide how to get back to the Smith-Waterman algorithm?

More articles: