I have a dataset with longitudinal data in a human-oriented format, as such:
pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ... 1 1 1 0 3 2 1 2 0 1 0 2 2 1 ... 50k 1 0 1 3 1 0
This results in a large data frame with minimum 50k observations and 90 variables measured over up to 29 periods.
I would like to get a more period-oriented format:
pid index start stop varA varB varC ... 1 1 ... 1 2 ... 1 29 2 1
I tried different approaches to changing the form of the data ( *apply
, plyr
, reshape2
, loops, adding and reshape2
all the numeric matrices, etc.), but it does not seem to get a decent processing time (+ 40 minutes for subsets). I picked up various hints on the path to what to avoid, but I'm still not sure if I will miss a bottleneck or possible acceleration.
Is there an optimal way of approaching such data processing so that I can estimate the processing time in the best case that I can achieve in pure R-code? There were similar questions in Stackoverflow, but they did not lead to convincing answers ...
source share