How to apply the same function to each specified column in a data table.

I have a data table with which I would like to perform the same operation on specific columns. The names of these columns are indicated in the character vector. In this particular example, I would like to multiply all of these columns by -1.

Some toy data and a vector indicating the corresponding columns:

library(data.table) dt <- data.table(a = 1:3, b = 1:3, d = 1:3) cols <- c("a", "b") 

Now I do it like this, sorting through a character vector:

 for (col in 1:length(cols)) { dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))] } 

Is there a way to do this directly without a for loop?

+61
r data.table
May 30 '13 at 21:44
source share
6 answers

It works:

 dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols] 

Result

  abd 1: -1 -1 1 2: -2 -2 2 3: -3 -3 3 

There are a few tricks here:

  • Since (cols) := has parentheses, the result is assigned to the columns specified in cols , and not to some new variable called "cols".
  • .SDcols tells the call that we are considering only these columns and allows us to use the .SD , S ubset D ata associated with these columns.
  • lapply(.SD, ...) runs on .SD , which is a list of columns (e.g. all data.frames and data.tables). lapply returns a list, so at the end of j it looks like cols := list(...) .

EDIT . Here's another way, which is probably faster, as @Arun mentioned:

 for (j in cols) set(dt, j = j, value = -dt[[j]]) 
+111
May 30 '13 at 21:59
source share

I would like to add an answer if you want to also change the name of the columns. This is very convenient if you want to calculate the logarithm of several columns, which is often the case in empirical work.

 cols <- c("a", "b") out_cols = paste("log", cols, sep = ".") dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols] 
+9
Mar 30 '17 at 8:16
source share

UPDATE: The following is a great way to do this without a for loop

 dt[,(cols):= - dt[,..cols]] 

This is a convenient way to read code easily. But as far as performance is concerned, it remains with Frank's decision according to the microbenchmark result below.

 mbm = microbenchmark( base = for (col in 1:length(cols)) { dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))] }, franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols], franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]), hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols], orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]], orhans_solution2 = dt[,(cols):= - dt[,..cols]], times=1000 ) mbm Unit: microseconds expr min lq mean median uq max neval base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000 franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000 franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000 hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000 orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000 orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000 

as shown in the chart below

performance_comparison_chart

My previous answer: the following also works

 for (j in cols) dt[,(j):= -1 * dt[, ..j]] 
+6
Apr 02 '18 at 12:57
source share

It seems that none of the solutions listed above works with group calculation. The following is the best I got:

 for(col in cols) { DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g] } 
+1
Nov 19 '18 at 18:43
source share
 library(data.table) (dt <- data.table(a = 1:3, b = 1:3, d = 1:3)) Hence: abd 1: 1 1 1 2: 2 2 2 3: 3 3 3 Whereas (dt*(-1)) yields: abd 1: -1 -1 -1 2: -2 -2 -2 3: -3 -3 -3 
+1
Jan 23 '19 at 16:12
source share

Add an example to create new columns based on a string vector of columns. Based on Jfly's answer:

 dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10))) col0 <- c("a", "b", "c") col1 <- paste0("max.", col0) for(i in seq_along(col0)) { dt[, (col1[i]) := max(get(col0[i])), g] } dt[,.N, c("g", col1)] 
0
04 Feb '19 at 10:29
source share



All Articles