R generates all possible interaction variables

I have a dataframe with variables, say a, b, c, d

dat <- data.frame(a=runif(1e5), b=runif(1e5), c=runif(1e5), d=runif(1e5)) 

and I would like to generate all possible bilateral conditions for the interaction between each of the columns, namely: ab, ac, ad, bc, bd, cd. In fact, there are more than 100 columns in my data core, so I can’t manually encode it. What is the most efficient way to do this (noting that I don't want both ab and ba)?

+4
source share
2 answers

What do you plan to do with all these interaction conditions? There are several options that will best depend on what you are trying to do.

If you want to convey interactions with a modeling function like lm or aov , then it is very simple, just use the syntax .^2 :

 fit <- lm( y ~ .^2, data=mydf ) 

The above will call lm and tell it that it will correspond to all the main effects and all two-way interactions for the variables in mydf , excluding y .

If for some reason you really want to calculate all interactions, you can use model.matrix :

 tmp <- model.matrix( ~.^2, data=iris) 

This will include a column for interception and columns for the main effects, but you can opt out of them if you don't want them.

If you need something other than modeling, you can use the combn function, as @akrun mentions in the comments.

+9
source

Assuming the expected result is a combination of column names (the comments should have a_b , a_c , etc.), we can use combn in the column names of the data set and specify m as 2.

 combn(colnames(dat), 2, FUN=paste, collapse='_') #[1] "a_b" "a_c" "a_d" "b_c" "b_d" "c_d" 

If we need to multiply combinations of columns in 'dat', we will multiply the data set using each combn output combn of the column names ( dat[,x[1]] , dat[,x[2]] ), multiply ( * ) it, convert in 'data.frame' ( data.frame( ), set the column names ( setNames ) to paste combination of column names. We create output in list and cbind list items using do.call(cbind .

 do.call(cbind, combn(colnames(dat), 2, FUN= function(x) list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]), paste(x, collapse="_")) ))) # a_b a_c a_d b_c b_d c_d #1 0.26929788 0.17697473 0.26453066 0.55676619 0.83221898 0.54691008 #2 0.06291005 0.08337501 0.04455453 0.10370775 0.05542008 0.07344851 #3 0.53789990 0.47301970 0.03112880 0.51305076 0.03376319 0.02969076 #4 0.41596384 0.34920860 0.25992717 0.53948322 0.40155468 0.33711187 #5 0.16878584 0.21232357 0.09196025 0.08162171 0.03535148 0.04447027 

Benchmarks

 set.seed(494) dat <- data.frame(a=runif(1e6), b=runif(1e6), c=runif(1e6), d=runif(1e6)) greg <- function()model.matrix( ~.^2, data=dat) akrun <- function() {do.call(cbind, combn(colnames(dat), 2, FUN= function(x) list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]), paste(x, collapse="_")) )))} system.time(greg()) # user system elapsed # 1.159 0.024 1.182 system.time(akrun()) # user system elapsed # 0.013 0.000 0.013 library(microbenchmark) microbenchmark(greg(), akrun(), times=20L, unit='relative') # Unit: relative # expr min lq mean median uq max neval cld # greg() 39.63122 38.53662 10.23198 18.81274 6.568741 4.642702 20 b # akrun() 1.00000 1.00000 1.00000 1.00000 1.000000 1.000000 20 a 

NOTE. Tests differ in the number of columns, the number of rows. Here I use the number of columns as shown in the OP post.

data

 set.seed(24) dat <- data.frame(a=runif(5), b=runif(5), c=runif(5), d=runif(5)) 
+3
source

All Articles