Creating indicators in large data frames

The goal is to create indicators for the factor / string variable in the data frame. This data frame has> 2 mm rows and R starts on windows, I have no way to use plyr with .parallel = T. Therefore, I take the split and win route using plyr and reshape2.

The start of the melt and fill ends from memory and with

ddply( idata.frame(items) , c("ID") , function(x){
       (    colSums( model.matrix( ~ x$element - 1) ) > 0   )
} , .progress="text" )    

or

ddply( idata.frame(items) , c("ID") , function(x){
           (    elements %in% x$element   )
    } , .progress="text" )  

takes some time. The fastest approach is the call for action below. Do you see a way to speed it up? The% in% statement is faster than calling model.matrix. Thank.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)

tapply( dd$prd , dd$id , function(x) prds %in% x )
+5
source share
3 answers

bigmemory bigtabulate . :

library(bigmemory)
library(bigtabulate)

set.seed(123)

dd <- data.frame(
  id = sample( 1:15, size=2e6 , replace=T ), 
  prd = letters[sample( 1:15, size=2e6 , replace=T )]
  )

prds <- unique(dd$prd)

benchmark(
bigtable(dd,c(1,2))>0,
table(dd[,1],dd[,2])>0,
xtabs(~id+prd,data=dd)>0,
tapply( dd$prd , dd$id , function(x) prds %in% x )
)

( ):

                                            test replications elapsed relative user.self sys.self user.child sys.child
1                      bigtable(dd, c(1, 2)) > 0          100  54.401 1.000000    51.759    3.817          0         0
2                    table(dd[, 1], dd[, 2]) > 0          100 112.361 2.065422   107.526    6.614          0         0
4 tapply(dd$prd, dd$id, function(x) prds %in% x)          100 178.308 3.277660   166.544   13.275          0         0
3                xtabs(~id + prd, data = dd) > 0          100 229.435 4.217478   217.014   16.660          0         0

, bigtable . , , . ?bigtable .

+4

, , .. ( , , , / 1...)? , xtabs , ...

library(rbenchmark)
benchmark(
          tapply( dd$prd , dd$id , function(x) prds %in% x ),
          xtabs(~id+prd,data=dd)>0)

     test        replications elapsed relative 
1 tapply(...)             100   0.053 1.000000
2 xtabs(...) > 0          100   0.120 2.264151  
+2

Your use of the function %in%seems to me the opposite. And if you want to get a true / false result for each row of data, you should use either% in% as a vector operation, or ave. Although it is not needed here, you can use it if there is a more complex function that should be applied to each element.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)
target.prds <- prds[1:2]
dd$prd.in.trgt <- with( dd, prd %in% target.prds)
+1
source

All Articles