Update multiple data.table columns

I am trying to do a simple thing by dividing 40 columns of a data table by their average value. I cannot provide the actual data (not all columns are numeric, but I have> 8M rows), but here is an example:

library(data.table) dt <- data.table(matrix(sample(1:100,4000,T),ncol=40)) colmeans <- colMeans(dt) 

Next, I thought what I would do:

 for (col in names(colmeans)) dt[,col:=dt[,col]/colmeans[col]] 

But this returns an error since dt[,col] requires that column names not be specified. Using as.name(col) does not cut it. Now,

 res <- t(t(dt[,1:40,with=F]/colmeans)) 

contains an accelerated result, but I cannot insert it back into the data table.

 dt[,1:40] <- res 

does not work, and dt[,1:40:=res, with=F] .

The following works, but I find it pretty ugly:

 for (i in seq_along(colmeans)) dt[,i:=dt[,i,with=F]/colmeans[i],with=F] 

Of course, I could also recreate the new data table by calling data.table() on res and other non-numeric columns that my data.table has, but isnโ€™t it something more efficient?

+8
r data.table
source share
4 answers

What about

 dt[, (names(dt)) := lapply(.SD, function(x) x/mean(x))] 

If you need to specify specific columns, you can use

 dt[, 1:40 := lapply(.SD, function(x) x/mean(x)), .SDcols = 1:40] 

or

 cols <- names(dt)[c(1,5,10)] dt[, (cols) := lapply(.SD, function(x) x/mean(x)), .SDcols = cols] 
+20
source share

We can also use set . In this case, there should be no noticeable difference in using [.data.table along with := , but in scenarios where [.data.table needs to be called several times, using set() helps to avoid this overhead and can be noticeably faster.

 for(j in names(dt)) { set(dt, i=NULL, j = j, value = dt[[j]]/mean(dt[[j]])) } 

It can also be performed in selected columns, i.e.

 nm1 <- names(dt)[1:5] for(j in nm1){ set(dt, i = NULL, j = j, value = dt[[j]]/mean(dt[[j]])) } 

data

 set.seed(24) dt <- as.data.frame(matrix(sample(1:100,4000,TRUE),ncol=40)) setDT(dt) 
+3
source share

dplyr 0.4.3

To divide all columns by their average value, you can do:

 dplyr::mutate_each(dt, funs(. / mean(.))) 

Or specify column positions:

 dplyr::mutate_each(dt, funs(. / mean(.)), 5:10) 

Column Names:

 dplyr::mutate_each_(dt, funs(. / mean(.)), colnames(dt)[5:10]) 

dplyr 0.4.3.9000

If you want to split the numeric columns, the devel dplyr version has mutate_if , which works with columns for which the predicate returns TRUE

 dplyr::mutate_if(dt, is.numeric, funs(. / mean(.))) 
+3
source share

How about some magic melt and dcast . This converts the data to the "long" format, and then back to the original "wide" format.

Firstly, the melt variable is in the ID:

 # make an ID variable dt[, idvar := 1:nrow(dt)] # melt the data on the ID variable dt2 <- melt(dt, "idvar") 

Then we perform the division according to the average operation in each group:

 # use data.table by = to do a fast division by group mean dt2[, divByMean := value / mean(value), by = variable] dt2 ## idvar variable value divByMean ## 1: 1 V1 15 0.2859867 ## 2: 2 V1 92 1.7540515 ## 3: 3 V1 27 0.5147760 ## 4: 4 V1 7 0.1334604 ## 5: 5 V1 18 0.3431840 ## --- ## 3996: 96 V40 54 1.1111111 ## 3997: 97 V40 51 1.0493827 ## 3998: 98 V40 23 0.4732510 ## 3999: 99 V40 8 0.1646091 ## 4000: 100 V40 11 0.2263374 

Then go back to the original wide format:

 # now dcast back to "wide" dt3 <- dcast(dt2, idvar ~ variable, mean, value.var = "divByMean") dt3[1:5, 1:5] ## idvar V1 V2 V3 V4 ## 1 1 0.2859867 0.6913303 0.2110919 1.6156624 ## 2 2 1.7540515 0.7847534 0.5948954 1.8817715 ## 3 3 0.5147760 0.2615845 0.8827480 0.4181715 ## 5 5 0.3431840 0.3550075 0.3646133 0.3231325 ## 4 4 0.1334604 1.7937220 1.3241220 1.3685611 
+1
source share

All Articles