R, data.table or dplyr, decoding with long colnames formats

Question

R, data.table or dplyr, decoding with long colnames formats

Imagine that I have a dataframe with column names like Mary1, Mary2, Mary3, Bob1, Bob2, Bob3, Pam1, Pam2, Pam3 etc., but with many columns.

Here is a simpler reproducible example.

set.seed(1) mydata <- data.frame() mydata <- rbind(mydata,c(1,round(runif(30),3))) mydata <- rbind(mydata,c(2,round(runif(30),3))) mydata <- rbind(mydata,c(3,round(runif(30),3))) colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3))

which gives:

 id A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3 F1 F2 F3 G1 G2 G3 H1 H2 H3 I1 I2 I3 J1 J2 J3 ... 1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 ... 2 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407 ... 3 0.913 0.294 0.459 0.332 0.651 0.258 0.479 0.766 0.084 0.875 0.339 0.839 0.347 0.334 0.476 0.892 0.864 0.390 0.777 0.961 0.435 0.713 0.400 0.325 0.757 0.203 0.711 0.122 0.245 0.143 ...

I want to get the format of a long table, for example:

 set.seed(1) mydata <- data.frame() mydata <- rbind(mydata,c(1,1,round(runif(10),3))) mydata <- rbind(mydata,c(1,2,round(runif(10),3))) mydata <- rbind(mydata,c(1,3,round(runif(10),3))) mydata <- rbind(mydata,c(2,1,round(runif(10),3))) mydata <- rbind(mydata,c(2,2,round(runif(10),3))) mydata <- rbind(mydata,c(2,3,round(runif(10),3))) colnames(mydata) <- c("id","N", LETTERS[1:10])

this is:

  id NABCDEFGHIJ 1 1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 1 2 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777 1 3 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411 2 1 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 2 2 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 2 3 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407

How can I get it with data.table or dplyr / tidyr? or any other simple option.

If i try

 melt(mydata, id=1)

the result is a single column.

I checked the official help and vignettes, but I can only find simpler examples, with a small number of columns, the user specifies each of them manually and one example template (), but I can’t adapt it to my example.

Other threads use gsub, but this is confusing for me.

What I really want to do is a little more complicated, but I think this is the first step (I will repeat it later). Imagine that my columns are Mary1, Mary2, Bob1, Bob2, Pam1, Pam2 ... I want to create new columns with the differences from the two above: Mary1-Mary2, Bob1-Bob2, Pam1-Pam2 ...

Summing up: I do not want to write the name of all columns manually, but automatically delete them, deleting the last digit.

PD: Okay, I am updating my question. It should work not only for names such as A1, A2 ... but also for longer names such as

 colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), rep(LETTERS[1:10], each=3), 1:3))

I am not against speed, I look at something simple, not mysterious.

+1

r reshape

skan May 03 '16 at 18:41

source share

3 answers

Arun · Answer 1 · 2016-05-03T18:58:58+0000

Using data.table::melt :

 require(data.table) n = unique(gsub("[0-9]+$", "", names(mydata)[-1L])) p = paste0("^", n) melt(setDT(mydata), measure=patterns(p), value.name=n, variable.name="N") # id NABCDEFGHIJ # 1: 1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382 # 2: 2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519 # 3: 3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122 # 4: 1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870 # 5: 2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662 # 6: 3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245 # 7: 1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340 # 8: 2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407 # 9: 3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143

rawr · Answer 2 · 2016-05-03T19:00:04+0000

 set.seed(1) mydata <- data.frame() mydata <- rbind(mydata,c(1,round(runif(30),3))) mydata <- rbind(mydata,c(2,round(runif(30),3))) mydata <- rbind(mydata,c(3,round(runif(30),3))) colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3)) reshape(mydata, dir = 'long', varying = names(mydata)[-1], sep = '', timevar = 'N') # id NABCDEFGHIJ # 1.1 1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382 # 2.1 2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519 # 3.1 3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122 # 1.2 1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870 # 2.2 2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662 # 3.2 3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245 # 1.3 1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340 # 2.3 2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407 # 3.3 3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143

Gopala · Answer 3 · 2016-05-03T18:54:53+0000

Here is one solution with tidyr :

 library(tidyr) mydata %>% gather(key, value, -id) %>% separate(key, into = c('key1', 'key2'), sep = '(?<=[a-zA-Z])(?=[0-9])') %>% spread(key1, value)

Result:

  id key2 ABCDEFGHIJ 1 1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382 2 1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870 3 1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340 4 2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519 5 2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662 6 2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407 7 3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122 8 3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245 9 3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143

R, data.table or dplyr, decoding with long colnames formats

More articles: