Add two columns to dataframe based on column names

Example: I have a data frame

> a = data.frame(T_a_1=c(1,2,3,4,5),T_a_2=c(2,3,4,5,6),T_b_1=c(3,4,5,6,7),T_c_1=c(4,5,6,7,8),length=c(1,2,3,4,5)) > a T_a_1 T_a_2 T_b_1 T_c_1 length 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 

I want to add (or perform some other operation like (column1 + column2) / length in name-based columns. Like T_a (T_a_1 and T_a_2), this is a common name between two columns (1st and 2nd), so I wanted would add them.

+4
source share
2 answers

I would use the grep for the job to map column names to some pattern. Here are some examples:

 > a = data.frame(T_a_1=c(1,2,3,4,5), + T_a_2=c(2,3,4,5,6), + T_b_1=c(3,4,5,6,7), + T_c_1=c(4,5,6,7,8), + length=c(1,2,3,4,5)) > > # display only columns that match T_a > a[,grep('T_a', colnames(a))] T_a_1 T_a_2 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 > > # sum > sum(a[,grep('T_a', colnames(a))]) [1] 35 > > #rowsum > rowSums(a[,grep('T_a', colnames(a))]) [1] 3 5 7 9 11 > > # your example (row1 + row2) / length > rowSums(a[,grep('T_a', colnames(a))]) / a$length [1] 3.000000 2.500000 2.333333 2.250000 2.200000 

UPDATE:

In the comments below, I understand that you want to sum the matching lines, grouped by a common prefix and separated by a length column. The following code is an inelegant solution to the problem:

 > a = data.frame(ES51_223_1=c(1,2,3,4,5), + ES51_312_1=c(2,3,4,5,6), + ES52_223_2=c(3,4,5,6,7), + ES52_312_2=c(4,5,6,7,8), + ES53_223_3=c(1,2,3,4,5), + length=c(1,2,3,4,5)) > > # get the unique prefixes > prefixes = unique(unlist(lapply(colnames(subset(a, select=-length)), function(x) { strsplit(x, '_')[[1]][[1]]}))) > > f = function(prefix) { + return (rowSums(subset(a, select=grep(prefix, colnames(a)))) / a$length) + } > m = matrix(unlist(lapply(prefixes, f)), nrow=nrow(a)) > colnames(m) = prefixes > m ES51 ES52 ES53 [1,] 3.000000 7.000000 1 [2,] 2.500000 4.500000 1 [3,] 2.333333 3.666667 1 [4,] 2.250000 3.250000 1 [5,] 2.200000 3.000000 1 

m is a matrix containing the results for different prefixes in different columns.

+3
source

How about this?

 # data df <- structure(list(ES51_223_1 = 1:5, ES51_312_1 = 2:6, ES52_223_2 = 3:7, ES52_312_2 = 4:8, ES53_223_3 = 1:5, length = 1:5), .Names = c("ES51_223_1", "ES51_312_1", "ES52_223_2", "ES52_312_2", "ES53_223_3", "length"), row.names = c(NA, -5L), class = "data.frame") # create indices from factor levels (shortcut) ids <- gsub("_.*$", "", setdiff(names(df), "length")) ids <- factor(as.numeric(factor(ids))) > ids # [1] 1 1 2 2 3 # Levels: 1 2 3 # use the levels to fetch columns and sum them o <- sapply(as.numeric(levels(ids)), function(x) { rowSums(df[which(ids == x)])/df$length }) > o # [,1] [,2] [,3] # [1,] 3.000000 7.000000 1 # [2,] 2.500000 4.500000 1 # [3,] 2.333333 3.666667 1 # [4,] 2.250000 3.250000 1 # [5,] 2.200000 3.000000 1 
+2
source

All Articles