A little profiling shows that most of your time is spent in [<-.data.frame .
Therefore, scaling problems arise from the way Ops.data.frame and [<-.dataframe , and how [<-.data.frame copies, and [[<- copies for a named list.
Relevant code in Ops.data.frame (with my comments)
# cn is the names of your data.frame for (j in seq_along(cn)) { left <- if (!lscalar) e1[[j]] else e1 right <- if (!rscalar) e2[[j]] else e2 value[[j]] <- eval(f) } # sometimes return a data.frame if (.Generic %in% c("+", "-", "*", "/", "%%", "%/%")) { names(value) <- cn data.frame(value, row.names = rn, check.names = FALSE, check.rows = FALSE) } # sometimes return a matrix else matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, dimnames = list(rn, cn))
When you use Ops.data.frame , it will cycle through the columns in the for loop, using [[<- to replace each time. This means that as the number of columns increases, the increasing time will increase (since there will be some kind of protective internal copying, since this data.frame is called a list) - therefore, it will scale linearly with the number of columns
# for example only this part will scale with the number of columns f.df.1 <- function( df , x = 0.5 ){ df <- df - x return( df ) } microbenchmark(f.df.1(df1),f.df.1(df2),f.df.1(df3), times = 10L)
[<-.data.frame has a similar loop through columns, when i is a logical matrix of the same dimension as x
if(is.logical(i) && is.matrix(i) && all(dim(i) == dim(x))) { nreplace <- sum(i, na.rm=TRUE) if(!nreplace) return(x) # nothing to replace ## allow replication of length(value) > 1 in 1.8.0 N <- length(value) if(N > 1L && N < nreplace && (nreplace %% N) == 0L) value <- rep(value, length.out = nreplace) if(N > 1L && (length(value) != nreplace)) stop("'value' is the wrong length") n <- 0L nv <- nrow(x) for(v in seq_len(dim(i)[2L])) { thisvar <- i[, v, drop = TRUE] nv <- sum(thisvar, na.rm = TRUE) if(nv) { if(is.matrix(x[[v]])) x[[v]][thisvar, ] <- if(N > 1L) value[n+seq_len(nv)] else value else x[[v]][thisvar] <- if(N > 1L) value[n+seq_len(nv)] else value } n <- n+nv } return(x) f.df.2 <- function( df , x = 0.5 ){ df[df < 0 ] <- 0 return( df ) } microbenchmark(f.df.2(df1), f.df.2(df2), f.df.2(df3), times = 10L) # Unit: milliseconds # expr min lq median uq max neval # f.df.2(df1) 20.500873 20.575801 20.699469 20.993723 84.825607 10 # f.df.2(df2) 3.143228 3.149111 3.173265 3.353779 3.409068 10 # f.df.2(df3) 1.581727 1.634463 1.707337 1.876240 1.887746 10
[<- data.frame (and <- ) will also copy
How to improve. You can use lapply or set from the data.table package
library(data.table) sdf <- function(df, x = 0.5){