Why is sapply much slower to process rows than columns in a dataframe in R?

Question

Why is sapply much slower to process rows than columns in a dataframe in R?

Consider the following script, which we will call Foo.r

 set.seed(1) x=matrix(rnorm(1000*1000),ncol=1000) x=data.frame(x) dummy = sapply(1:1000,function(i) sum(x[i,]) ) #dummy = sapply(1:1000,function(i) sum(x[,i]) )

When the first dummy line is commented out, we sum the columns and the code takes less than a second to run on my machine.

 $ time Rscript Foo.r real 0m0.766s user 0m0.536s sys 0m0.080s

When the second line of dummy commented out (and the first is commented out), we summarize the lines, and the runtime approaches 30 seconds.

 $ time Rscript Foo.r real 0m30.589s user 0m30.248s sys 0m0.104s

Note that I know the standard summation functions rowSums and colSums , but I use the sum only as an example for this strange asymmetric performance behavior.

+8

r

merlin2011 Dec 02 '13 at 1:05

source share

1 answer

joran · Answer 1 · 2013-12-02T01:34:23+0000

This is actually not a sapply result; rather, it is related to how data frames are stored, and the consequences for retrieving rows compared to columns. Data frames are stored in lists, where each list item is a column.

This means that extracting columns is easier than extracting rows.

To demonstrate that this has nothing to do with sapply , consider this using the x data frame:

 foo1 <- function(){ + for (i in 1:1000){ + tmp <- x[i, ] + } + } > > foo2 <- function(){ + for (i in 1:1000){ + tmp <- x[ ,i] + } + } > system.time(foo2()) user system elapsed 0.029 0.000 0.031 > system.time(foo1()) user system elapsed 15.986 0.074 15.894

If you need to do things equally and quickly, data frames will often be a bad choice. To work with strings, you need to extract the corresponding elements from each element of the list. To work with columns, you only need to scroll through the columns.

Why is sapply much slower to process rows than columns in a dataframe in R?

More articles: