I am looking for some tips on some data restructuring. I collect some data using Google Forms, which are downloaded as a CSV file, and looks something like this:
# alpha beta option
Data has two variables (alpha and beta), which are listed in each list. For most of my data, there is only one number in each variable. However, for some observations, there may be two, three, or even up to ten numbers. This is because these are answers collected using the "checkbox" option on google forms, which allows multiple answers to one survey question. Also, this may be important for some potential solutions that google forms return leading spaces in front of each of several answers.
In my real data, this happens only in a very small proportion of all observations, the above example is more concise. There are several other variables in the dataset. Here, I simply include one called option, which contains factors.
What I need to do is duplicate all the observations that contain multiple numbers in the alpha or beta variable. The number of duplicate lines must be equal to the number of numbers that exist in the alpha or beta variable. Then I need to replace the sequence of numbers in the alpha or beta variables with each number independently. This will result in the following:
# alpha beta option
Here is the data that reproduces the source data of the example above. I called the dataframe demo:
demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label = c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L, 4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class = "factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple", "lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class = "data.frame", row.names = c(NA, -7L))
OK So I think I wrote code that leads in a very tight way to the new data file I'm looking for. However, there seems to be a more elegant and better way to do this.
Basically, I work on an alpha variable first. I first select observations based on whether the commas exist in the variable or not. With observations that contain commas, I then use strsplit to separate the numbers. Then I count how many numbers exist for each observation and duplicate each observation. Then I melt the broken numbers into a data frame with all the numbers in a variable called "value". I am simply replacing the alpha variable with the data in the molten value variable. Then I return data with data that does not contain commas. Then I use this df and work on the beta variable ....
Here is my solution (does it seem to work?):
library(reshape2) demo$a<-grepl(",", demo$alpha) demo.atrue <- demo[ which(demo$a=='TRUE'), ] demo.afalse <- demo[ which(demo$a=='FALSE'), ] demo.atrue$alpha<-as.character(demo.atrue$alpha) temp<-strsplit(demo.atrue$alpha, ",") temp.lengths<-lapply(temp, length) for (i in 1:length(temp)) { df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3] } temp.melt<-melt(temp) df.expanded$alpha<-temp.melt$value demo.afalse<-demo.afalse[c(1:3)] demonew<-rbind(demo.afalse, df.expanded) demonew$b<-grepl(",", demonew$beta) demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ] demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ] demonew.btrue$beta<-as.character(demonew.btrue$beta) temp<-strsplit(demonew.btrue$beta, ",") temp.lengths<-lapply(temp, length) for (i in 1:length(temp)) { df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3] } temp.melt<-melt(temp) df.expanded1$beta<-temp.melt$value demonew.bfalse<-demonew.bfalse[c(1:3)] demonew1<-rbind(df.expanded1, demonew.bfalse) demonew1
And maybe not very effective, I'm not sure if this will work in all conditions. In particular, if for the same observation there are both plurals and variables "alpha" and "beta". I tested it with a few examples, and it looks fine, but I'm not sure about that.
Thanks for attention.