Accelerate the conversion of face to data format in R to R format

Question

Accelerate the conversion of face to data format in R to R format

I have a dataset with longitudinal data in a human-oriented format, as such:

pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ... 1 1 1 0 3 2 1 2 0 1 0 2 2 1 ... 50k 1 0 1 3 1 0

This results in a large data frame with minimum 50k observations and 90 variables measured over up to 29 periods.

I would like to get a more period-oriented format:

 pid index start stop varA varB varC ... 1 1 ... 1 2 ... 1 29 2 1

I tried different approaches to changing the form of the data ( *apply , plyr , reshape2 , loops, adding and reshape2 all the numeric matrices, etc.), but it does not seem to get a decent processing time (+ 40 minutes for subsets). I picked up various hints on the path to what to avoid, but I'm still not sure if I will miss a bottleneck or possible acceleration.

Is there an optimal way of approaching such data processing so that I can estimate the processing time in the best case that I can achieve in pure R-code? There were similar questions in Stackoverflow, but they did not lead to convincing answers ...

+4

performance r reshape data-manipulation

mhermans Oct 17 '11 at 8:40

source share

2 answers

The well-placed stack () function can be very fast if everything fits in memory.

For a large set, it is better to use the (transparent) sqlite database as an intermediate. Try Gabor package sqldf, there are many examples on googlecode.

http://code.google.com/p/sqldf/

+1

Dieter menne Oct 17 '11 at 9:17

source share

Oscar Perpiñán · Accepted Answer · 2011-10-17T10:05:44+0000

First create an example of the data (I use 5e3 instead of 50e3 to avoid memory problems in my configuration):

 nObs <- 5e3 nVar <- 90 nPeriods <- 29 dat <- matrix(rnorm(nObs*nVar*nPeriods), nrow=nObs, ncol=nVar*nPeriods) df <- data.frame(id=seq_len(nObs), dat) nmsV <- paste('Var', seq_len(nVar), sep='') nmsPeriods <- paste('T', seq_len(nPeriods), sep='') nms <- c(outer(nmsV, nmsPeriods, paste, sep='_')) names(df)[-1] <- nms

And now with stats::reshape you change the format:

 df2 <- reshape(df, dir = "long", varying = 2:((nVar*nPeriods)+1), sep = "_")

I am not sure if this is the quick solution you are looking for.

Accelerate the conversion of face to data format in R to R format

More articles: