Complement the DNA sequence

Question

Complement the DNA sequence

Suppose I have a DNA sequence. I want to get his complement. I used the following code, but I do not understand it. What am I doing wrong?

s=readline() ATCTCGGCGCGCATCGCGTACGCTACTAGC p=unlist(strsplit(s,"")) h=rep("N",nchar(s)) unlist(lapply(p,function(d){ for b in (1:nchar(s)) { if (p[b]=="A") h[b]="T" if (p[b]=="T") h[b]="A" if (p[b]=="G") h[b]="C" if (p[b]=="C") h[b]="G" }

+8

replace r bioinformatics genetics complement

Anurag mishra Dec 04 '13 at 9:43

source share

7 answers

Spacedman · Answer 1 · 2013-12-04T10:54:57+0000

Use chartr , which is created for this purpose:

 > s [1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > chartr("ATGC","TACG",s) [1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"

Just specify two string character strings and a string. Also vectorized by argument for translation:

 > chartr("ATGC","TACG",c("AAAACG","TTTTT")) [1] "TTTTGC" "AAAAA"

Note. I am replacing a string representation of DNA, not a vector. To convert a vector, I would create a search map as a named vector and an index that:

 > p [1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T" [20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C" > map=c("A"="T", "T"="A","G"="C","C"="G") > unname(map[p]) [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" [20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"

Martin morgan · Answer 2 · 2013-12-04T17:42:28+0000

The Bioconductor Biostrings package has many useful features for this kind of operation. Install once:

 source("http://bioconductor.org/biocLite.R") biocLite("Biostrings")

then use

 library(Biostrings) dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA")) complement(dna)

42- · Answer 3 · 2013-12-04T09:58:11+0000

 sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") ATCTCGGCGCGCATCGCGT "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" ACGCTACTAGC "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"

If you do not need additional names, you can always separate them using unname .

 unname(sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") ) [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" [19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" >

JeremyS · Answer 4 · 2014-01-14T04:10:10+0000

There is also a seqinr package

 library(seqinr) comp(seq) # gives complement rev(comp(seq)) # gives the reverse complement

Biostrings has a much smaller memory profile, but seqinr is also good because you can select the base case (including mixed) and change them to whatever you want, for example, if you want the combination of T and U in the same sequence. Biostrings makes you have either T or U.

Megatron · Answer 5 · 2016-06-14T14:31:02+0000

For additions, both in upper and lower case, you can use chartr() :

 n <- "ACCTGccatGCATC" chartr("acgtACGT", "tgcaTGCA", n) # [1] "TGGACggtaCGTAG"

To take a step further and reverse complement the nucleotide sequence, you can use the following function:

 library(stringi) rc <- function(nucSeq) return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq))) rc("AcACGTgtT") # [1] "AacACGTgT"

pedrosaurio · Answer 6 · 2018-11-06T14:36:07+0000

Here is the answer using the p base. It is written with terrible formatting to clarify the situation and save it as one line. Supports upper and lower case.

 revc = function(s){ paste0( rev( unlist( strsplit( chartr("ATGCatgc","TACGtacg",s) , "") # from strsplit ) # from unlist ) # from rev , collapse='') # from paste0 }

Tom kelly · Answer 7 · 2019-04-04T08:25:04+0000

I generalized the rev(comp(seq)) seqinr with the seqinr package:

 install.packages("devtools") devtools::install_github("TomKellyGenetics/tktools") tktools::revcomp(seq)

This version is compatible with string inputs and is vectorized for list processing or vector input of multiple lines. The output class must match the input, including registers and types. It also supports input containing "U" for the output sequences of RNA and RNA.

 > seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > revcomp(seq) [1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT" > seq <- c("TATAAT", "TTTCGC", "atgcat") > revcomp(seq) TATAAT TTTCGC atgcat "ATTATA" "GCGAAA" "atgcat"

See the github TomKellyGenetics / tktools package guide or repository .

Complement the DNA sequence

More articles: