I am trying to implement local sequence alignment in Python using the Smith-Waterman algorithm.
Here is what I still have. This comes to building a similarity matrix :
import sys, string from numpy import * f1=open(sys.argv[1], 'r') seq1=f1.readline() f1.close() seq1=string.strip(seq1) f2=open(sys.argv[2], 'r') seq2=f2.readline() f2.close() seq2=string.strip(seq2) a,b =len(seq1),len(seq2) penalty=-1; point=2; #generation of matrix for local alignment p=zeros((a+1,b+1)) # table calculation and matrix generation for i in range(1,a+1): for j in range(1,b+1): vertical_score =p[i-1][j]+penalty; horizontal_score= p[i][j-1]+penalty; if seq1[i-1]==seq2[j-1]: diagonal_score =p[i-1][j-1]+point; else: diagonal_score = p[i-1][j-1]+penalty; p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); print p
For example, with two sequences:
agcacact acacacta
my code outputs a similarity matrix:
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 2. 1. 2. 1. 2. 1. 0. 2.] [ 0. 1. 1. 1. 1. 1. 1. 0. 1.] [ 0. 0. 3. 2. 3. 2. 3. 2. 1.] [ 0. 2. 2. 5. 4. 5. 4. 3. 4.] [ 0. 1. 4. 4. 7. 6. 7. 6. 5.] [ 0. 2. 3. 6. 6. 9. 8. 7. 8.] [ 0. 1. 4. 5. 8. 8. 11. 10. 9.] [ 0. 0. 3. 4. 7. 7. 10. 13. 12.]]
Now I'm stuck in the next step of the algorithm, backtracking, to build the best alignment.
Wikipedia says
To get the optimal local alignment, we start with the highest value in the matrix (i, j). Then we return to one of the positions (i - 1, j), (i, j - 1) and (i - 1, j - 1) depending on the direction of motion used to build the matrix. We keep this process until we reach the matrix cell with a zero value or a value in position (0, 0).
I’m having trouble deciding which position to return to. What does Wikipedia mean under “depending on the direction of motion used to build the matrix”, and how can I implement this in Python?
Finally i did it
import sys, string from numpy import* import re # read first sequence fasta_sequence1=open(sys.argv[1], 'r') seq1="" for i in fasta_sequence1: if i.startswith(">"): pass else: seq1 = seq1 + i.strip() fasta_sequence1.close() fasta_sequence2=open(sys.argv[2], 'r') seq2 = "" for i in fasta_sequence2: if i.startswith('>'): pass else: seq2 = seq2+ i.strip() fasta_sequence2.close() a,b =len(seq1),len(seq2) penalty=-1; point=2; #generation of matrix for local alignment p=zeros((a+1,b+1)) #intialization of max score max_score=0; #pointer to store the traceback path pointer=zeros((a+1,b+1)) # table calculation and matrix generation for i in range(1,a+1): for j in range(1,b+1): vertical_score =p[i-1][j]+penalty; horizontal_score= p[i][j-1]+penalty; if seq1[i-1]==seq2[j-1]: diagonal_score =p[i-1][j-1]+point; else: diagonal_score = p[i-1][j-1]+penalty; for i in range(1,a+1): for j in range(1,b+1): p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); for i in range(1,a+1): for j in range(1,b+1): if p[i][j]==0: pointer[i][j]=0; #0 means end of the path if p[i][j]==vertical_score: pointer[i][j]=1; #1 means trace up if p[i][j]==horizontal_score: pointer[i][j]=2; #2 means trace left if p[i][j]==diagonal_score: pointer[i][j]=3; #3 means trace diagonal if p[i][j]>=max_score: maximum_i=i; maximum_j=j; max_score=p[i][j]; #for i in range(1,a+1): # for j in range(1,b+1): #if p[i][j]>max_score: #max_score=p[i][j] print max_score # traceback of all the pointers align1,align2='',''; #initial sequences i,j=max_i,max_j; #indices of path starting point while pointer[i][j]!=0: if pointer[i][j]==3: align1=align1+seq1[i-1]; align2=align2+seq2[j-1]; i=i-1; j=j-1; elif pointer[i][j]==2: align1=align1+'-'; align2=align2+seq2[j-1] j=j-1; elif pointer[i][j]==1: align1=align1+seq1[i-1]; align2=align2+'-'; i=i-1; align1=align1[::-1]; #reverse sequence 1 align2=align2[::-1]; #reverse sequence 2 #output_file = open(sys.argv[3],"w") #output_file.write(align1) #output_file.write(align2) print align1 print align2
But I think there might be a better and more efficient way to do this
output_file = open(sys.argv[3],"w") output_file.write(align1) output_file.write(align2)
the result shows that
agcacacta-cacact
opposite c for:
print align1 print align2
shows the correct result:
agcacact a-cacact
How can I get the above output in a file entry. Thanks