How to get a pair "sequence similarity score" for ~ 1000 proteins?

I have a large number of protein sequences in fasta format.

I want to get a pairwise sequence similarity score for each pair of proteins.

Can any packet in R be used to obtain fusion similarity scores for protein sequences?

+7
source share
2 answers

According to the proposal, Chase bioconductor really fits, and in particular the Biostrings package. To install the latter, I would suggest installing the bioconductor base library as such:

 source("http://bioconductor.org/biocLite.R") biocLite() 

This way you cover all the dependencies. Now, to align 2 protein sequences or any two sequences in this case, you will need to use pairwiseAlignment from Biostrings . Given the fasta protseq.fasta file of two sequences, which looks like this:

 >protein1 MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK IPVHPNDHVNKSQ >protein2 MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT TKHHEGFTNW* 

If you want to globally align these 2 sequences using, for example, BLOSUM100 as your replacement matrix, 0 is a fine for opening the gap and -5 for expanding one of them:

 require("Biostrings") data(BLOSUM100) seqs <- readFASTA("~/Desktop/protseq.fasta", strip.descs=TRUE) alm <- pairwiseAlignment(seqs[[1]]$seq, seqs[[2]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5) 

The result of this is (some alignment removed to save space):

 > alm Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] MYRALRLLARSRPLVRA-PAAALAS.... subject: [1] MR-------SRP---AGPALLLLL.... score: -91 

To extract an invoice for each alignment:

 > score(alm) [1] -91 

With this in mind, you can now easily perform all pairwise alignments using a very simple looping logic. To better change pair alignment with bioconductor , I suggest you read this .

An alternative approach would be to perform multiple sequence alignment rather than pairwise. You can use bio3d , and from there seqaln , to align all the sequences in your fasta file.

+11
source

6 years later, but:

The protr package protr , which has a parallel affinity counting function, parGOsim() . It can accept lists of protein sequences, so the loop is not needed for writing.

0
source

All Articles