Optimization: splitting data into a list of data, converting data to a string

Preliminaries : this question is mainly of educational value, the urgent task is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and / or implemented more elegantly. You can use additional packages, such as plyr or reshape. The actual data takes about 140 seconds, which is much higher than the simulated data, since some of the source lines contain nothing but NA, and additional checks need to be performed. For comparison, the simulated data is processed after about 30 seconds.

Conditions : the data set contains 360 variables, 30 times more than set 12. Name them V1_1, V1_2 ... (first set), V2_1, V2_2 ... (second set), and so on. Each set of 12 variables contains dichotomous answers (yes / no), in practice, corresponding to career status. For example: work (yes / no), research (yes / no), etc., Only 12 statuses, repeating 30 times.

Task : the task is to recode each set of 12 dichotomous variables into one variable with 12 categories of answers (for example, work, research ...). In the end, we should get 30 variables, each of which has 12 categories of answers.

Data : I can't post the actual dataset, but here is a good simulated approximation:

randomRow <- function() { # make a row with a single 1 and some NA's sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F) } # create a data frame with 12 variables and 1500 cases makeDf <- function() { data <- matrix(NA,ncol=12,nrow=1500) for (i in 1:1500) { data[i,] <- randomRow() } return(data) } mydata <- NULL # combine 30 of these dataframes horizontally for (i in 1:30) { mydata <- cbind(mydata,makeDf()) } mydata <- as.data.frame(mydata) # example data ready 

My solution :

 # Divide the dataset into a list with 30 dataframes, each with 12 variables S1 <- lapply(1:30,function(i) { Z <- rep(1:30,each=12) # define selection vector mydata[Z==i] # use selection vector to get groups of variables (x12) }) recodeDf <- function(df) { result <- as.numeric(apply(df,1,function(x) { if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row })) # the if/else check is for the real data return(result) } # Combine individual position vectors into a dataframe final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf))) 

In general, there is a double * apply function, one according to the list, the other according to the data lines. This makes it a little slow. Any suggestions? Thanks in advance.

+8
optimization r apply
source share
4 answers

I really like the @Arun matrix multiplication idea. Interestingly, if you compile R for some OpenBLAS libraries, you can get this to work in parallel.

However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution using your original template, but much faster than your implementation:

 # Match is usually faster than which, because it only returns the first match # (and therefore won't fail on multiple matches) # It also neatly handles your *all NA* case recodeDf2 <- function(df) apply(df,1,match,x=1) # You can split your data.frame by column with split.default # (Using split on data.frame will split-by-row) S2<-split.default(mydata,rep(1:30,each=12)) final.df2<-lapply(S2,recodeDf2) 

If you had a very large data frame and many processors, you can consider parallelizing this operation with:

 library(parallel) final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores) # Where numcores is your number of processors. 

After reading @Arun and @mnel, I learned a lot about how to improve this function by avoiding forcing the array by processing data.frame by column rather than row. I am not going to "steal" the answer here; The OP should consider switching the flag to @mnel's answer.

I wanted, however, to share a solution that does not use data.table , and avoids for . However, it is still slower than @mnel's solution, albeit a bit.

 nograpes2<-function(mydata) { test<-function(df) { l<-lapply(df,function(x) which(x==1)) lens<-lapply(l,length) rep.int(seq.int(l),times=lens)[order(unlist(l))] } S2<-split.default(mydata,rep(1:30,each=12)) data.frame(lapply(S2,test)) } 

I would also like to add that the @Aaron approach using which with arr.ind=TRUE also be very fast and elegant if mydata started as matrix and not data.frame . Forcing matrix slower than the rest of the function. If speed were a problem, it would be useful to consider the data as a matrix in the first place.

+4
source share

Here is an approach that is mostly instantaneous. (system.time = 0.1 seconds)

se set . The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.

 MYD <- data.table(mydata) # a new data.table (changed to numeric : Arun) newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE)) # for each column, which values equal 1 whiches <- lapply(MYD, function(x) which(x == 1)) # create a list of column matches (those you wish to aggregate) columnMatch <- split(names(mydata), rep(1:30,each = 12)) setattr(columnMatch, 'names', names(newDT)) # cycle through all new columns # and assign the the rows in the new data.table ## Arun: had to generate numeric indices for ## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem. for(jj in seq_along(columnMatch)) { for(ii in seq_along(columnMatch[[jj]])) { set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii) } } 

This will work the same way as adding columns by reference to the original.

The set note also works with data.frames ....

+5
source share

IIUC, you only have one 1 for every 12 columns. You are left with 0 or NA. If so, the operation can be performed much faster with this idea.

Idea: Instead of going through each row and requesting position 1 , you can use a 1500 * 12 matrix, where each row is just 1:12 . I.e:

 mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE) 

Now you can multiply this matrix by each of your subsets of data.frame (from the same sizes, 1500 * 12 here), and they take their "rowSums" (which are vectorized) with na.rm = TRUE . This will lead directly to the line where you have 1 (because 1 will be multiplied by the corresponding value between 1 and 12).


implementation of data.table: Here I will use data.table to illustrate this idea. Since it creates a column by links, I expect that the same idea used in data.frame will be a little slower, although it should dramatically speed up your current code.

 require(data.table) DT <- data.table(mydata) ids <- seq(1, ncol(DT), by=12) # for multiplying with each subset and taking rowSums to get position of 1 mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE) for (i in ids) { sdcols <- i:(i+12-1) # keep appending the new columns by reference to the original data DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat, na.rm = TRUE), .SDcols = sdcols] } # delete all original 360 columns by reference from the original data DT[, grep("V", names(DT), value=TRUE) := NULL] 

Now you have 30 columns left that correspond to position 1. In my system, this takes about 0.4 seconds.

 all(unlist(final.df) == unlist(DT)) # not a fan of `identical` # [1] TRUE 
+4
source share

Another way this can be done with the R base is to simply get the values ​​you want to put into the new matrix and fill them directly with indexing the matrix.

 idx <- which(mydata==1, arr.ind=TRUE) # get indices of 1's i <- idx[,2] %% 12 # get column that was 1 idx[,2] <- ((idx[,2] - 1) %/% 12) + 1 # get "group" and put in "col" of idx out <- array(NA, dim=c(1500,30)) # make empty matrix out[idx] <- i # and fill it in! 
+4
source share

All Articles