Update multiple data.table columns

Question

Update multiple data.table columns

I am trying to do a simple thing by dividing 40 columns of a data table by their average value. I cannot provide the actual data (not all columns are numeric, but I have> 8M rows), but here is an example:

library(data.table) dt <- data.table(matrix(sample(1:100,4000,T),ncol=40)) colmeans <- colMeans(dt)

Next, I thought what I would do:

 for (col in names(colmeans)) dt[,col:=dt[,col]/colmeans[col]]

But this returns an error since dt[,col] requires that column names not be specified. Using as.name(col) does not cut it. Now,

 res <- t(t(dt[,1:40,with=F]/colmeans))

contains an accelerated result, but I cannot insert it back into the data table.

 dt[,1:40] <- res

does not work, and dt[,1:40:=res, with=F] .

The following works, but I find it pretty ugly:

 for (i in seq_along(colmeans)) dt[,i:=dt[,i,with=F]/colmeans[i],with=F]

Of course, I could also recreate the new data table by calling data.table() on res and other non-numeric columns that my data.table has, but isn’t it something more efficient?

+8

r data.table

jeanlain Jun 09 '16 at 8:36

source share

4 answers

We can also use set . In this case, there should be no noticeable difference in using [.data.table along with := , but in scenarios where [.data.table needs to be called several times, using set() helps to avoid this overhead and can be noticeably faster.

 for(j in names(dt)) { set(dt, i=NULL, j = j, value = dt[[j]]/mean(dt[[j]])) }

It can also be performed in selected columns, i.e.

 nm1 <- names(dt)[1:5] for(j in nm1){ set(dt, i = NULL, j = j, value = dt[[j]]/mean(dt[[j]])) }

data

 set.seed(24) dt <- as.data.frame(matrix(sample(1:100,4000,TRUE),ncol=40)) setDT(dt)

+3

akrun Jun 09 '16 at 8:45

source share

dplyr 0.4.3

To divide all columns by their average value, you can do:

 dplyr::mutate_each(dt, funs(. / mean(.)))

Or specify column positions:

 dplyr::mutate_each(dt, funs(. / mean(.)), 5:10)

Column Names:

 dplyr::mutate_each_(dt, funs(. / mean(.)), colnames(dt)[5:10])

dplyr 0.4.3.9000

If you want to split the numeric columns, the devel dplyr version has mutate_if , which works with columns for which the predicate returns TRUE

 dplyr::mutate_if(dt, is.numeric, funs(. / mean(.)))

+3

Steven beaupré Jun 09 '16 at 10:01

source share

How about some magic melt and dcast . This converts the data to the "long" format, and then back to the original "wide" format.

Firstly, the melt variable is in the ID:

 # make an ID variable dt[, idvar := 1:nrow(dt)] # melt the data on the ID variable dt2 <- melt(dt, "idvar")

Then we perform the division according to the average operation in each group:

 # use data.table by = to do a fast division by group mean dt2[, divByMean := value / mean(value), by = variable] dt2 ## idvar variable value divByMean ## 1: 1 V1 15 0.2859867 ## 2: 2 V1 92 1.7540515 ## 3: 3 V1 27 0.5147760 ## 4: 4 V1 7 0.1334604 ## 5: 5 V1 18 0.3431840 ## --- ## 3996: 96 V40 54 1.1111111 ## 3997: 97 V40 51 1.0493827 ## 3998: 98 V40 23 0.4732510 ## 3999: 99 V40 8 0.1646091 ## 4000: 100 V40 11 0.2263374

Then go back to the original wide format:

 # now dcast back to "wide" dt3 <- dcast(dt2, idvar ~ variable, mean, value.var = "divByMean") dt3[1:5, 1:5] ## idvar V1 V2 V3 V4 ## 1 1 0.2859867 0.6913303 0.2110919 1.6156624 ## 2 2 1.7540515 0.7847534 0.5948954 1.8817715 ## 3 3 0.5147760 0.2615845 0.8827480 0.4181715 ## 5 5 0.3431840 0.3550075 0.3646133 0.3231325 ## 4 4 0.1334604 1.7937220 1.3241220 1.3685611

+1

Ken benoit Jun 09 '16 at 9:00

source share

docendo discimus · Accepted Answer · 2016-06-09T08:43:49+0000

What about

 dt[, (names(dt)) := lapply(.SD, function(x) x/mean(x))]

If you need to specify specific columns, you can use

 dt[, 1:40 := lapply(.SD, function(x) x/mean(x)), .SDcols = 1:40]

or

 cols <- names(dt)[c(1,5,10)] dt[, (cols) := lapply(.SD, function(x) x/mean(x)), .SDcols = cols]

Update multiple data.table columns

data

More articles: