Get indices of repeated instances of vector elements in another vector (both very large)

I have two vectors: one (A) about 100 million frantic elements (integers), the other (B) of 1 million identical unique elements. I am trying to get a list containing duplicate indices of each element of B in A.

A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3

# would result in this:
[[1]]
[1] 2 3 4 6 7

[[2]]
[1]  1  5 10

[[3]]
[1] 8 9

At first I naively tried:

b_indices <- lapply(B, function(b) which(A == b))

which is terribly inefficient and does not seem to end in a few years.

The second thing I tried was to create a list of empty vectors indexed by all the elements from B, and then loop through A, adding an index to the corresponding vector for each element in A. Although technically O (n), I'm not sure that you need to repeatedly add elements. This approach, apparently, will take 2-3 days, which is still too slow ...

-, ?

+4
3

data.table, , R, 1 !

require(data.table)
a <- data.table(x=rep(c("a","b","c"),each=3))
a[ , list( yidx = list(.I) ) , by = x ]

   a  yidx
1: a 1,2,3
2: b 4,5,6
3: c 7,8,9

:

a <- data.table(x=c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2))
a[ , list( yidx = list(.I) ) , by = x ]

   a      yidx
1: 2   1, 5,10
2: 1 2,3,4,6,7
3: 3       8,9

. , , , . , data.table .

46%, order Debian 5%, order Windows 8 2.x .

B <- seq_len(1e6)
set.seed(42)
A <- data.table(x = sample(B, 1e8, TRUE))
system.time({
+   res <- A[ , list( yidx = list(.I) ) , by = x ]
+ })
   user  system elapsed 
   4.25    0.22    4.50 
+6

:

A1 <- order(A, method = "radix")

split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1]  1  5 10
#
#$`3`
#[1] 8 9

B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)

system.time({
  A1 <- order(A, method = "radix")

  res <- split(A1, A[A1])
})
# user      system     elapsed 
#8.650       1.056       9.704
+8

We can also use dplyr

library(dplyr)
data_frame(A) %>% 
      mutate(B = row_number()) %>%
      group_by(A) %>%
      summarise(B = list(B)) %>% 
      .$B

#[[1]]
#[1] 2 3 4 6 7

#[[2]]
#[1]  1  5 10

#[[3]]
#[1] 8 9

In a smaller dataset of size 1e5, it gives system.time

#   user  system elapsed 
#   0.01    0.00    0.02 

but with a great example, as shown in another post, it is slower. However, this dplyr...

+2
source

All Articles