Effective R code to find indices associated with unique values ​​in a vector

Suppose I have a vector vec <- c("D","B","B","C","C") .

My goal is to end up with a list of sizes length(unique(vec)) , where each i this list returns an index vector that indicates the location of unique(vec)[i] in vec .

For example, this list for vec will return:

 exampleList <- list() exampleList[[1]] <- c(1) #Since "D" is the first element exampleList[[2]] <- c(2,3) #Since "B" is the 2nd/3rd element. exampleList[[3]] <- c(4,5) #Since "C" is the 4th/5th element. 

I tried the following approach, but it is too slow. My example is great, so I need faster code:

 vec <- c("D","B","B","C","C") uniques <- unique(vec) exampleList <- lapply(1:3,function(i) { which(vec==uniques[i]) }) exampleList 
+6
source share
4 answers

Update: The behavior of DT[, list(list(.)), by=.] Sometimes led to incorrect results in version R> = 3.1.0. This has now been fixed in commit # 1280 in the current version of data.table v1.9.3. From NEWS :

  • DT[, list(list(.)), by=.] Returns the correct results in R> = 3.1.0. The error occurred due to recent (welcome) changes in R v3.1.0, where list(.) Does not lead to copying. Closes # 481 .

Using data.table about 15 times faster than tapply :

 library(data.table) vec <- c("D","B","B","C","C") dt = as.data.table(vec)[, list(list(.I)), by = vec] dt # vec V1 #1: D 1 #2: B 2,3 #3: C 4,5 # to get it in the desired format # (perhaps in the future data.table setnames will work for lists instead) setattr(dt$V1, 'names', dt$vec) dt$V1 #$D #[1] 1 # #$B #[1] 2 3 # #$C #[1] 4 5 

Speed ​​tests:

 vec = sample(letters, 1e7, T) system.time(tapply(seq_along(vec), vec, identity)[unique(vec)]) # user system elapsed # 7.92 0.35 8.50 system.time({dt = as.data.table(vec)[, list(list(.I)), by = vec]; setattr(dt$V1, 'names', dt$vec); dt$V1}) # user system elapsed # 0.39 0.09 0.49 
+6
source

You can do this with tapply :

 vec <- c("D", "B", "B", "C", "C") tapply(seq_along(vec), vec, identity)[unique(vec)] # $D # [1] 1 # # $B # [1] 2 3 # # $C # [1] 4 5 

The identity function returns its argument as a result, and indexing with unique(vec) ensures that you return it in the same order of elements in the original vector.

+4
source
 split(seq_along(vec), vec) 

this is faster and shorter than tapply solution:

 vec = sample(letters, 1e7, T) system.time(res1 <- tapply(seq_along(vec), vec, identity)[unique(vec)]) # user system elapsed # 1.808 0.364 2.176 system.time(res2 <- split(seq_along(vec), vec)) # user system elapsed # 0.876 0.152 1.029 
+4
source

To maintain josilber's response order, simply index the result using the uniques vector you created:

 vec <- c("D","B","B","C","C") uniques <- unique(vec) tapply(seq_along(vec), vec, identity)[uniques] # $D # [1] 1 # # $B # [1] 2 3 # # $C # [1] 4 5 
+1
source

All Articles