How to count the frequencies of a certain character in a string?

If I have a character mileage, like "AABBABBBAAAABBAAAABBBAABBBBABABB" .

Is there a way to get R to count the runs of A and indicate how many of each length?

So, I would like to know how many instances of 3 A per row, how many instances of one A , how many instances of 2 A per row, etc.

+8
string regex r
source share
4 answers

Try

  v1 <- scan(text=gsub('[^A]+', ',', str1), sep=',', what='', quiet=TRUE) table(v1[nzchar(v1)]) # A AA AAAA # 3 2 2 

or

  library(stringi) table(stri_extract_all_regex(str1, '[A]+')[[1]]) # A AA AAAA # 3 2 2 

Benchmarks

  set.seed(42) x1 <- stri_rand_strings(1,1e7, pattern='[AG]') system.time(table(stri_split_regex(x1, "[^A]+", omit_empty = TRUE))) # user system elapsed # 0.829 0.002 0.831 system.time(table(stri_extract_all_regex(x1, '[A]+')[[1]])) # user system elapsed # 0.790 0.002 0.791 system.time(table(rle(strsplit(x1,"")[[1]])) ) # user system elapsed # 30.230 1.243 31.523 system.time(table(strsplit(x1, "[^A]+"))) # user system elapsed # 4.253 0.006 4.258 system.time(table(attr(gregexpr("A+",x1)[[1]], 'match.length'))) # user system elapsed # 1.994 0.004 1.999 library(microbenchmark) microbenchmark(david=table(stri_split_regex(x1, "[^A]+", omit_empty = TRUE)), akrun= table(stri_extract_all_regex(x1, '[A]+')[[1]]), david2 = table(strsplit(x1, "[^A]+")), glen = table(rle(strsplit(x1,"")[[1]])), plannapus = table(attr(gregexpr("A+",x1)[[1]], 'match.length')), times=20L, unit='relative') #Unit: relative # expr min lq mean median uq max neval cld # david 1.0000000 1.000000 1.000000 1.000000 1.0000000 1.000000 20 a # akrun 0.7908313 1.023388 1.054670 1.336510 0.9903384 1.004711 20 a # david2 4.9325256 5.461389 5.613516 6.207990 5.6647301 5.374668 20 c # glen 14.9064240 15.975846 16.672339 20.570874 15.8710402 15.465140 20 d #plannapus 2.5077719 3.123360 2.836338 3.557242 2.5689176 2.452964 20 b 

data

  str1 <- 'AABBABBBAAAABBAAAABBBAABBBBABABB' 
+10
source share
 table(rle(strsplit("AABBABBBAAAABBAAAABBBAABBBBABABB","")[[1]])) 

gives

  values lengths AB 1 3 1 2 2 3 3 0 2 4 2 1 

which (reading column A) means that there were 3 A runs of length 1, 2 A runs of length 2 and 2 A runs of length 4.

+10
source share

Here's an additional way to use strsplit

 x <- "AABBABBBAAAABBAAAABBBAABBBBABABB" table(strsplit(x, "[^A]+")) # A AA AAAA # 3 2 2 

Or similar to stringi package

 library(stringi) table(stri_split_regex(x, "[^A]+", omit_empty = TRUE)) 
+8
source share

For completeness, here is another way using the regmatches and gregexpr to extract regular expressions:

 x <- "AABBABBBAAAABBAAAABBBAABBBBABABB" table(regmatches(x,gregexpr("A+",x))[[1]]) # A AA AAAA # 3 2 2 

Or in fact, since gregexpr saves the length of the captured substring as an attribute, you can even directly:

 table(attr(gregexpr("A+",x)[[1]],'match.length')) # 1 2 4 # 3 2 2 
+3
source share

All Articles