Complement the DNA sequence

Suppose I have a DNA sequence. I want to get his complement. I used the following code, but I do not understand it. What am I doing wrong?

s=readline() ATCTCGGCGCGCATCGCGTACGCTACTAGC p=unlist(strsplit(s,"")) h=rep("N",nchar(s)) unlist(lapply(p,function(d){ for b in (1:nchar(s)) { if (p[b]=="A") h[b]="T" if (p[b]=="T") h[b]="A" if (p[b]=="G") h[b]="C" if (p[b]=="C") h[b]="G" } 
+8
replace r bioinformatics genetics complement
source share
7 answers

Use chartr , which is created for this purpose:

 > s [1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > chartr("ATGC","TACG",s) [1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG" 

Just specify two string character strings and a string. Also vectorized by argument for translation:

 > chartr("ATGC","TACG",c("AAAACG","TTTTT")) [1] "TTTTGC" "AAAAA" 

Note. I am replacing a string representation of DNA, not a vector. To convert a vector, I would create a search map as a named vector and an index that:

 > p [1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T" [20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C" > map=c("A"="T", "T"="A","G"="C","C"="G") > unname(map[p]) [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" [20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" 
+13
source share

The Bioconductor Biostrings package has many useful features for this kind of operation. Install once:

 source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") 

then use

 library(Biostrings) dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA")) complement(dna) 
+11
source share
 sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") ATCTCGGCGCGCATCGCGT "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" ACGCTACTAGC "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" 

If you do not need additional names, you can always separate them using unname .

 unname(sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") ) [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" [19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" > 
+5
source share

There is also a seqinr package

 library(seqinr) comp(seq) # gives complement rev(comp(seq)) # gives the reverse complement 

Biostrings has a much smaller memory profile, but seqinr is also good because you can select the base case (including mixed) and change them to whatever you want, for example, if you want the combination of T and U in the same sequence. Biostrings makes you have either T or U.

+5
source share

For additions, both in upper and lower case, you can use chartr() :

 n <- "ACCTGccatGCATC" chartr("acgtACGT", "tgcaTGCA", n) # [1] "TGGACggtaCGTAG" 

To take a step further and reverse complement the nucleotide sequence, you can use the following function:

 library(stringi) rc <- function(nucSeq) return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq))) rc("AcACGTgtT") # [1] "AacACGTgT" 
+4
source share

Here is the answer using the p base. It is written with terrible formatting to clarify the situation and save it as one line. Supports upper and lower case.

 revc = function(s){ paste0( rev( unlist( strsplit( chartr("ATGCatgc","TACGtacg",s) , "") # from strsplit ) # from unlist ) # from rev , collapse='') # from paste0 } 
0
source share

I generalized the rev(comp(seq)) seqinr with the seqinr package:

 install.packages("devtools") devtools::install_github("TomKellyGenetics/tktools") tktools::revcomp(seq) 

This version is compatible with string inputs and is vectorized for list processing or vector input of multiple lines. The output class must match the input, including registers and types. It also supports input containing "U" for the output sequences of RNA and RNA.

 > seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > revcomp(seq) [1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT" > seq <- c("TATAAT", "TTTCGC", "atgcat") > revcomp(seq) TATAAT TTTCGC atgcat "ATTATA" "GCGAAA" "atgcat" 

See the github TomKellyGenetics / tktools package guide or repository .

0
source share

All Articles