Collect multiple sets of columns

Question

Collect multiple sets of columns

I have data from an online survey where respondents go through a series of questions 1-3 times. The survey software (Qualtrics) writes this data into multiple columns, i.e. Q3.2 in the survey will have Q3.2.1. columns Q3.2.1. Q3.2.2. and Q3.2.3. :

 df <- data.frame( id = 1:10, time = as.Date('2009-01-01') + 0:9, Q3.2.1. = rnorm(10, 0, 1), Q3.2.2. = rnorm(10, 0, 1), Q3.2.3. = rnorm(10, 0, 1), Q3.3.1. = rnorm(10, 0, 1), Q3.3.2. = rnorm(10, 0, 1), Q3.3.3. = rnorm(10, 0, 1) ) # Sample data id time Q3.2.1. Q3.2.2. Q3.2.3. Q3.3.1. Q3.3.2. Q3.3.3. 1 1 2009-01-01 -0.2059165 -0.29177677 -0.7107192 1.52718069 -0.4484351 -1.21550600 2 2 2009-01-02 -0.1981136 -1.19813815 1.1750200 -0.40380049 -1.8376094 1.03588482 3 3 2009-01-03 0.3514795 -0.27425539 1.1171712 -1.02641801 -2.0646661 -0.35353058 ...

I want to combine all the columns of QN.N * into neat separate columns of QN.N, eventually getting something like this:

  id time loop_number Q3.2 Q3.3 1 1 2009-01-01 1 -0.20591649 1.52718069 2 2 2009-01-02 1 -0.19811357 -0.40380049 3 3 2009-01-03 1 0.35147949 -1.02641801 ... 11 1 2009-01-01 2 -0.29177677 -0.4484351 12 2 2009-01-02 2 -1.19813815 -1.8376094 13 3 2009-01-03 2 -0.27425539 -2.0646661 ... 21 1 2009-01-01 3 -0.71071921 -1.21550600 22 2 2009-01-02 3 1.17501999 1.03588482 23 3 2009-01-03 3 1.11717121 -0.35353058 ...

The tidyr library has a gather() function, which is great for combining one set of columns:

 library(dplyr) library(tidyr) library(stringr) df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% mutate(loop_number = str_sub(loop_number,-2,-2)) %>% select(id, time, loop_number, Q3.2) id time loop_number Q3.2 1 1 2009-01-01 1 -0.20591649 2 2 2009-01-02 1 -0.19811357 3 3 2009-01-03 1 0.35147949 ... 29 9 2009-01-09 3 -0.58581232 30 10 2009-01-10 3 -2.33393981

The resulting data frame has 30 rows, as expected (10 individuals, 3 loops each). However, collecting the second set of columns does not work correctly - it successfully makes two combined columns Q3.2 and Q3.3 , but ends in 90 rows instead of 30 (all combinations of 10 people, 3 cycles of Q3.2, and 3 cycles of Q3.3, the combinations will be increase significantly for each group of columns in the actual data):

 df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% gather(loop_number, Q3.3, starts_with("Q3.3")) %>% mutate(loop_number = str_sub(loop_number,-2,-2)) id time loop_number Q3.2 Q3.3 1 1 2009-01-01 1 -0.20591649 1.52718069 2 2 2009-01-02 1 -0.19811357 -0.40380049 3 3 2009-01-03 1 0.35147949 -1.02641801 ... 89 9 2009-01-09 3 -0.58581232 -0.13187024 90 10 2009-01-10 3 -2.33393981 -0.48502131

Is there a way to use multiple calls to gather() like this, combining small subsets of columns like this, while maintaining the correct number of rows?

+94

r dplyr tidyr reshape qualtrics

Andrew Sep 19 '14 at 2:41

source share

5 answers

This can be done using reshape . It is possible with dplyr though.

  colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df)) colnames(df)[2] <- "Date" res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_") row.names(res) <- 1:nrow(res) head(res) # id Date time Q3.2 Q3.3 #1 1 2009-01-01 1 1.3709584 0.4554501 #2 2 2009-01-02 1 -0.5646982 0.7048373 #3 3 2009-01-03 1 0.3631284 1.0351035 #4 4 2009-01-04 1 0.6328626 -0.6089264 #5 5 2009-01-05 1 0.4042683 0.5049551 #6 6 2009-01-06 1 -0.1061245 -1.7170087

Or using dplyr

  library(tidyr) library(dplyr) colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df)) df %>% gather(loop_number, "Q3", starts_with("Q3")) %>% separate(loop_number,c("L1", "L2"), sep="_") %>% spread(L1, Q3) %>% select(-L2) %>% head() # id time Q3.2 Q3.3 #1 1 2009-01-01 1.3709584 0.4554501 #2 1 2009-01-01 1.3048697 0.2059986 #3 1 2009-01-01 -0.3066386 0.3219253 #4 2 2009-01-02 -0.5646982 0.7048373 #5 2 2009-01-02 2.2866454 -0.3610573 #6 2 2009-01-02 -1.7813084 -0.7838389

Refresh

With tidyr_0.8.3.9000 we can use pivot_longer to pivot_longer multiple columns. (Using the changed column names from gsub above)

 library(dplyr) library(tidyr) df %>% pivot_longer(cols = starts_with("Q3"), names_to = c(".value", "Q3"), names_sep = "_") %>% select(-Q3) # A tibble: 30 x 4 # id time Q3.2 Q3.3 # <int> <date> <dbl> <dbl> # 1 1 2009-01-01 0.974 1.47 # 2 1 2009-01-01 -0.849 -0.513 # 3 1 2009-01-01 0.894 0.0442 # 4 2 2009-01-02 2.04 -0.553 # 5 2 2009-01-02 0.694 0.0972 # 6 2 2009-01-02 -1.11 1.85 # 7 3 2009-01-03 0.413 0.733 # 8 3 2009-01-03 -0.896 -0.271 #9 3 2009-01-03 0.509 -0.0512 #10 4 2009-01-04 1.81 0.668 # … with 20 more rows

NOTE. The values are different because there was no predefined seed when creating the input dataset.

+29

akrun Sep 19 '14 at 4:07

source share

With a recent upgrade to melt.data.table we can now melt multiple columns. With this we can:

 require(data.table) ## 1.9.5 melt(setDT(df), id=1:2, measure=patterns("^Q3.2", "^Q3.3"), value.name=c("Q3.2", "Q3.3"), variable.name="loop_number") # id time loop_number Q3.2 Q3.3 # 1: 1 2009-01-01 1 -0.433978480 0.41227209 # 2: 2 2009-01-02 1 -0.567995351 0.30701144 # 3: 3 2009-01-03 1 -0.092041353 -0.96024077 # 4: 4 2009-01-04 1 1.137433487 0.60603396 # 5: 5 2009-01-05 1 -1.071498263 -0.01655584 # 6: 6 2009-01-06 1 -0.048376809 0.55889996 # 7: 7 2009-01-07 1 -0.007312176 0.69872938

You can get the development version here .

+18

Arun Feb 28 '15 at 8:12

source share

This has nothing to do with tidyr and dplyr, but here is another option: merged.stack from my splitstackshape package , V1. 4.0 and higher.

 library(splitstackshape) merged.stack(df, id.vars = c("id", "time"), var.stubs = c("Q3.2.", "Q3.3."), sep = "var.stubs") # id time .time_1 Q3.2. Q3.3. # 1: 1 2009-01-01 1. -0.62645381 1.35867955 # 2: 1 2009-01-01 2. 1.51178117 -0.16452360 # 3: 1 2009-01-01 3. 0.91897737 0.39810588 # 4: 2 2009-01-02 1. 0.18364332 -0.10278773 # 5: 2 2009-01-02 2. 0.38984324 -0.25336168 # 6: 2 2009-01-02 3. 0.78213630 -0.61202639 # 7: 3 2009-01-03 1. -0.83562861 0.38767161 # <<:::SNIP:::>> # 24: 8 2009-01-08 3. -1.47075238 -1.04413463 # 25: 9 2009-01-09 1. 0.57578135 1.10002537 # 26: 9 2009-01-09 2. 0.82122120 -0.11234621 # 27: 9 2009-01-09 3. -0.47815006 0.56971963 # 28: 10 2009-01-10 1. -0.30538839 0.76317575 # 29: 10 2009-01-10 2. 0.59390132 0.88110773 # 30: 10 2009-01-10 3. 0.41794156 -0.13505460 # id time .time_1 Q3.2. Q3.3.

+10

A5C1D2H2I1M1N2O1R2T1 19 Sept. '14 at 6:43

source share

If you are like me and cannot decide how to use the "regular expression with capture groups" for extract , the following code will replicate the extract(...) in the Hadleys answer:

 df %>% gather(question_number, value, starts_with("Q3.")) %>% mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>% select(id, time, loop_number, question_number, value) %>% spread(key = question_number, value = value)

The problem is that the initial assembly forms a key column, which is actually a combination of two keys. I decided to use mutate in my original solution in the comments to split this column into two columns with equivalent information, the loop_number column and the question_number column. spread can then be used to transform long form data, which is a pair of key values (question_number, value) for wide format data.

+6

Alex Jun 28 '16 at 6:03

source share

hadley · Accepted Answer · 2014-09-19 10:44

This approach seems pretty natural to me:

 df %>% gather(key, value, -id, -time) %>% extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>% spread(question, value)

First, collect all the columns of the questions, use extract() to separate the question and loop_number , then spread() question back into the columns.

 #> id time loop_number Q3.2 Q3.3 #> 1 1 2009-01-01 1 0.142259203 -0.35842736 #> 2 1 2009-01-01 2 0.061034802 0.79354061 #> 3 1 2009-01-01 3 -0.525686204 -0.67456611 #> 4 2 2009-01-02 1 -1.044461185 -1.19662936 #> 5 2 2009-01-02 2 0.393808163 0.42384717

Collect multiple sets of columns

Refresh

More articles: