Preliminaries : this question is mainly of educational value, the urgent task is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and / or implemented more elegantly. You can use additional packages, such as plyr or reshape. The actual data takes about 140 seconds, which is much higher than the simulated data, since some of the source lines contain nothing but NA, and additional checks need to be performed. For comparison, the simulated data is processed after about 30 seconds.
Conditions : the data set contains 360 variables, 30 times more than set 12. Name them V1_1, V1_2 ... (first set), V2_1, V2_2 ... (second set), and so on. Each set of 12 variables contains dichotomous answers (yes / no), in practice, corresponding to career status. For example: work (yes / no), research (yes / no), etc., Only 12 statuses, repeating 30 times.
Task : the task is to recode each set of 12 dichotomous variables into one variable with 12 categories of answers (for example, work, research ...). In the end, we should get 30 variables, each of which has 12 categories of answers.
Data : I can't post the actual dataset, but here is a good simulated approximation:
randomRow <- function() {
My solution :
# Divide the dataset into a list with 30 dataframes, each with 12 variables S1 <- lapply(1:30,function(i) { Z <- rep(1:30,each=12) # define selection vector mydata[Z==i] # use selection vector to get groups of variables (x12) }) recodeDf <- function(df) { result <- as.numeric(apply(df,1,function(x) { if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row })) # the if/else check is for the real data return(result) } # Combine individual position vectors into a dataframe final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))
In general, there is a double * apply function, one according to the list, the other according to the data lines. This makes it a little slow. Any suggestions? Thanks in advance.