Efficient way to indicate multiple indicator variables on a string?

Question

Efficient way to indicate multiple indicator variables on a string?

Given the "empty" indicator frame:

Index Ind_A Ind_B 1 0 0 2 0 0 3 0 0 4 0 0

and dataframe values:

 Index Indicators 1 Ind_A 3 Ind_A 3 Ind_B 4 Ind_A

I want to end up with:

 Index Ind_A Ind_B 1 1 0 2 0 0 3 1 1 4 1 0

Is there a way to do this without a for loop?

+5

r dataframe

lapolonio May 13, '15 at 15:07

source share

3 answers

bgoldst · Answer 1 · 2015-05-13T15:17:18+0000

 indicator <- data.frame(Index=1:4,Ind_A=rep(0,4),Ind_B=rep(0,4)); values <- data.frame(Index=c(1,3,3,4),Indicators=c('Ind_A','Ind_A','Ind_B','Ind_A')); indicator[cbind(match(values$Index,indicator$Index),match(values$Indicators,names(indicator)))] <- 1; indicator; ## Index Ind_A Ind_B ## 1 1 1 0 ## 2 2 0 0 ## 3 3 1 1 ## 4 4 1 0

The most significant change in your editing is that indicator$Index now does not contain unique values (at least not by itself), so a simple match() from values$Index to indicator$Index not enough. Instead, we should run the outer() equality test on Index and Index2 to get a matrix of logic elements representing which rows in indicator correspond to each row of values for both keys. Assuming that the combined key with two columns is unique, we can calculate the row index in indicator from the linear (vector) index returned by which() .

 indicator[cbind((which(outer(values$Index,indicator$Index,`==`)&outer(values$Index2,indicator$Index2,`==`))-1)%/%nrow(values)+1,match(values$Indicators,names(indicator)))] <- 1; indicator; ## Index Index2 Ind_A Ind_B ## 1 1 10 1 1 ## 2 1 11 1 0 ## 3 2 10 0 1 ## 4 2 12 1 0 ## 5 3 10 1 0 ## 6 3 12 1 0 ## 7 4 10 1 1 ## 8 4 12 1 0

Here is another solution using merge() :

 indicator[cbind(merge(values,cbind(indicator,row=1:nrow(indicator)))$row,match(values$Indicators,names(indicator)))] <- 1; indicator; ## Index Index2 Ind_A Ind_B ## 1 1 10 1 1 ## 2 1 11 1 0 ## 3 2 10 0 1 ## 4 2 12 1 0 ## 5 3 10 1 0 ## 6 3 12 1 0 ## 7 4 10 1 1 ## 8 4 12 1 0

Performance

The first solution is more productive:

 first <- function() indicator[cbind((which(outer(values$Index,indicator$Index,`==`)&outer(values$Index2,indicator$Index2,`==`))-1)%/%nrow(values)+1,match(values$Indicators,names(indicator)))] <<- 1; second <- function() indicator[cbind(merge(values,cbind(indicator,row=1:nrow(indicator)))$row,match(values$Indicators,names(indicator)))] <<- 1; N <- 10000; system.time({ replicate(N,first()); }); ## user system elapsed ## 2.032 0.000 2.041 system.time({ replicate(N,first()); }); ## user system elapsed ## 2.047 0.000 2.038 system.time({ replicate(N,second()); }); ## user system elapsed ## 12.578 0.000 12.592 system.time({ replicate(N,second()); }); ## user system elapsed ## 12.64 0.00 12.66

Frank · Answer 2 · 2015-05-13T15:30:18+0000

I would use matrices:

 ind_mat <- as.matrix(ind_df[,-1]); rownames(ind_mat) <- ind_df[,1] val_mat <- cbind(match(val_df$Index,ind_df[,1]),match(val_df$Indicators,names(ind_df[-1]))) ind_mat[val_mat] <- 1L # Ind_A Ind_B # 1 1 0 # 2 0 0 # 3 1 1 # 4 1 0

You probably don't need an “Index” as a column, and you can just put it as rownames . If (i) your matrix of values is small relative to the index matrix, and (ii) your index column is 1:nrow(ind_df) , you should consider storing in a sparse matrix.

As for coercion to the matrix, this takes very little time, and you avoid the hassle of coercion later on for any operations with the matrices. Here is an example:

 n = 1e4 nind = 1e3 y <- rnorm(n) x <- matrix(sample(0:1,size=n*nind,replace=TRUE),ncol=nind) xd <- data.frame(1:nrow(x),x) # timing: 0.04 seconds on my computer system.time(as.matrix(xd[,-1])) # messiness, eg, for OLS y~0+x: immense solve(t(as.matrix(xd[,-1]))%*%as.matrix(xd[,-1]))%*%(t(as.matrix(xd[,-1]))%*%y)

The last line avoids matrix support; I do not see the point.

Colonel beauvel · Answer 3 · 2015-05-13T15:44:42+0000

I would do it straight:

 df = transform(df, Index=factor(Index, level=min(Index):max(Index))) as.data.frame.matrix(table(df)) # Ind_A Ind_B #1 1 0 #2 0 0 #3 1 1 #4 1 0

Data:

 df = structure(list(Index = c(1, 3, 3, 4), Indicators = c("Ind_A", "Ind_A", "Ind_B", "Ind_A")), .Names = c("Index", "Indicators" ), row.names = c(NA, -4L), class = "data.frame")

Efficient way to indicate multiple indicator variables on a string?

More articles: