Find duplicate characters in a sequence of R

For example, I have a line

"AAAAAAACGAAAAAACGAAADGCGEDCG" 

I want to count how many times CG repeated. How to do it?

+7
string regex r
source share
6 answers

You can use gregexpr to find the "CG" position in vec . We need to check if there was a match ( -1 ). The sum function counts the number of matches.

 > vec <- "AAAAAAACGAAAAAACGAAADGCGEDCG" > sum(gregexpr("CG", vec)[[1]] != -1) [1] 4 

If you have a row vector, you can use sapply :

 > vec <- c("ACACACACA", "GGAGGAGGAG", "AACAACAACAAC", "GGCCCGCCGC", "TTTTGTT", "AGAGAGA") > sapply(gregexpr("CG", vec), function(x) sum(x != -1)) [1] 0 0 0 2 0 0 

If you have a list of strings you can use unlist(vec) and then use the solution above.

+7
source share

Biostonductor Biostrings has matchPattern function

 countGC <- matchPattern("GC",DNSstring_object) 

Note that DNSstring_object is a FASTA sequence read using the biostring function readDNAStringSet or readAAStringSet

+4
source share

Use str_count from stringr . It is easy to remember and read, although not a basic function.

 library(stringr) str_count("AAAAAAACGAAAAAACGAAADGCGEDCG", "CG") # [1] 4 
+4
source share

In the R base, you can use substring with a loop to look for CG occurrences

 > str <- "AAAAAAACGAAAAAACGAAADGCGEDCG" > x <- sapply(seq(nchar(str)-1), function(i) substring(str, i, i+1) == 'CG') > sum(x) ## [1] 4 
+2
source share

It might be interesting to execute a control function of String functions

 ## Data require("stringi") vec = paste0(sample(LETTERS, 1e6, replace = TRUE), collapse = "") df <- data.frame(vec, vec, vec, vec, vec, vec, vec, vec, vec, vec, stringsAsFactors = FALSE) ### Base method base_fun <- function(x){ sapply(gregexpr("CG", x), function(x) sum(x != -1)) } ### Stringi Method stringi_fun <- function(x){ sapply(x, function(x) stri_count_fixed(x,"CG")) } ### Stringr method library(stringr) stringr_fun <- function(x){ sapply(x, function(x) str_count(x, "CG")) } base_fun(df) # [1] 1441 1441 1441 1441 1441 1441 1441 1441 1441 1441 stringi_fun(df) # vec vec.1 vec.2 vec.3 vec.4 vec.5 vec.6 vec.7 vec.8 vec.9 # 1441 1441 1441 1441 1441 1441 1441 1441 1441 1441 stringr_fun(df) # vec vec.1 vec.2 vec.3 vec.4 vec.5 vec.6 vec.7 vec.8 vec.9 # 1441 1441 1441 1441 1441 1441 1441 1441 1441 1441 require(rbenchmark) benchmark(base_fun(df), stringi_fun(df), stringr_fun(df)) # test replications elapsed relative user.self sys.self user.child sys.child # 1 base_fun(df) 100 17.499 1.000 17.513 0 0 0 # 2 stringi_fun(df) 100 34.897 1.994 34.926 0 0 0 # 3 stringr_fun(df) 100 17.555 1.003 17.564 0 0 0 

In this particular example, these are the results. Feel free to add or change them. base_fun (df) = stringr_fun (df)> stringi_fun (df)

EDIT: The search engine in stringi 0.2-3 has been greatly improved. New benchmarks (on another machine):

 benchmark(base_fun(df), stringi_fun(df), stringr_fun(df)) ## test replications elapsed relative user.self sys.self user.child sys.child ## 1 base_fun(df) 100 26.412 21.214 26.353 0.004 0 0 ## 2 stringi_fun(df) 100 1.245 1.000 1.241 0.000 0 0 ## 3 stringr_fun(df) 100 26.995 21.683 26.905 0.011 0 0 

So we have stringi <base = stringr

+1
source share

Use stri_count_fixed from stringi package

 require("stringi") dna=c("a","g","c","t") N=160 x=sample(dna,N,4) x2 <- stri_paste(x,collapse="") stri_count_fixed(x2,"gaga") ## 2 
0
source share

All Articles