This is possible without creating a function, although it requires lapply . If var is a factor, you can work with its levels; we can bind its columns to lapply , which traverses var levels and creates values, calls them using setNames and converts them to tbl_df .
df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var), function(x){as.integer(df$var == x)}), paste0('var2_', levels(df$var)))))
returns
Source: local data frame [10 x 5] var var_d var_c var2_c var2_d (fctr) (dbl) (dbl) (int) (int) 1 d 1 0 0 1 2 c 0 1 1 0 3 c 0 1 1 0 4 c 0 1 1 0 5 d 1 0 0 1 6 d 1 0 0 1 7 c 0 1 1 0 8 c 0 1 1 0 9 d 1 0 0 1 10 c 0 1 1 0
If var is a character vector, not a factor, you can do the same, but using unique instead of levels :
df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var), function(x){as.integer(df$var == x)}), paste0('var2_', unique(df$var)))))
Two notes:
- This approach will work regardless of the data type, but will be slower. Itโs big enough in your data so that it matters, it probably makes sense to store the data as a
factor anyway, as it contains many repeating levels. - Both versions retrieve data from
df$var , because it lives in the calling environment, not how it can exist in a larger chain, and assume that var does not change in what is passed. To refer to the dynamic value of var , other than dplyr , normal NSE is rather a pain, as I have seen.
Another alternative that is a bit simpler and factor diagnostic using reshape2::dcast :
library(reshape2) df %>% cbind(1 * !is.na(dcast(df, seq_along(var) ~ var, value.var = 'var')[,-1]))
It still pulls the df version from the calling environment, so the chain really only determines what you are joining. Since bind_cols used instead of bind_cols , the result will be data.frame , not tbl_df , so if you want to save all tbl_df (smart if the data is big) you need to replace cbind with bind_cols(as_data_frame( ... )) ; bind_cols doesn't seem to want to do the conversion for you.
Note, however, that although this version is simpler, it is comparatively slower, as on the factor data:
Unit: microseconds expr min lq mean median uq max neval factor 358.889 384.0010 479.5746 427.9685 501.580 3995.951 100 unique 547.249 585.4205 696.4709 633.4215 696.402 4528.099 100 dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796 100
and string data:
Unit: microseconds expr min lq mean median uq max neval unique 307.190 336.422 414.1031 362.6485 419.3625 3693.340 100 dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178 100
For small data, this does not matter, but for big data it may be worthwhile to perform complexity.