Using the mean with .SD and .SDcols in data.table

I am writing a very simple function to summarize data.tables columns. I pass one column at a time to a function, and then perform some diagnostic operations to figure out generalization options, and then generalize. I am doing this in data.table to use some very large datasets.

So, I use .SDcols to pass in a column to summarize and use functions in .SD in the j part of the data.table expression. Since I move one column at a time, I don’t use the foot. And what I find is that some functions work, while others do not. Below is the test data set I'm working with and the results that I see:

 dt <- data.table( a=1:10, b=as.factor(letters[1:10]), c=c(TRUE, FALSE), d=runif(10, 0.5, 100), e=c(0,1), f=as.integer(c(0,1)), g=as.numeric(1:10), h=c("cat1", "cat2", "cat3", "cat4", "cat5")) mean(dt$a) [1] 5.5 dt[, mean(.SD), .SDcols = "a"] [1] NA Warning message: In mean.default(.SD) : argument is not numeric or logical: returning NA dt[, sum(.SD), .SDcols = "a"] [1] 55 dt[, max(.SD), .SDcols = "a"] [1] 10 dt[, colMeans(.SD), .SDcols = "a"] a 5.5 dt[, lapply(.SD, mean), .SDcols = "a"] a 1: 5.5 

Interestingly, weighted.mean gives the wrong answer (55, sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the correct answer (5.5, average).

I tried disabling the data.table optimization to see if it was an internal data.table function, but that didn't change anything.

Perhaps this is just a problem using mean() in a list (which seems to return .SD )? I guess there was never a reason NOT to use the lapply paradigm with .SD ? It seems that only the lapply option returns a data table. The rest seem to return vectors, with the exception of colMeans, which returns something else (list?).

My main question is: why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the functions used.

Thanks.

+5
source share

All Articles