Duplication of observations of a data frame, but also the replacement of certain values โ€‹โ€‹of variables in R

I am looking for some tips on some data restructuring. I collect some data using Google Forms, which are downloaded as a CSV file, and looks something like this:

# alpha beta option # 6 8, 9, 10, 11 apple # 9 6 pear # 1 6 apple # 3 8, 9 pear # 3 6, 8 lime # 3 1 apple # 2, 4, 7, 11 9 lime 

Data has two variables (alpha and beta), which are listed in each list. For most of my data, there is only one number in each variable. However, for some observations, there may be two, three, or even up to ten numbers. This is because these are answers collected using the "checkbox" option on google forms, which allows multiple answers to one survey question. Also, this may be important for some potential solutions that google forms return leading spaces in front of each of several answers.

In my real data, this happens only in a very small proportion of all observations, the above example is more concise. There are several other variables in the dataset. Here, I simply include one called option, which contains factors.

What I need to do is duplicate all the observations that contain multiple numbers in the alpha or beta variable. The number of duplicate lines must be equal to the number of numbers that exist in the alpha or beta variable. Then I need to replace the sequence of numbers in the alpha or beta variables with each number independently. This will result in the following:

 # alpha beta option # 6 8 apple # 6 9 apple # 6 10 apple # 6 11 apple # 9 6 pear # 1 6 apple # 3 8 pear # 3 9 pear # 3 6 lime # 3 8 lime # 3 1 apple # 2 9 lime # 4 9 lime # 7 9 lime # 11 9 lime 

Here is the data that reproduces the source data of the example above. I called the dataframe demo:

 demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label = c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L, 4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class = "factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple", "lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class = "data.frame", row.names = c(NA, -7L)) 

OK So I think I wrote code that leads in a very tight way to the new data file I'm looking for. However, there seems to be a more elegant and better way to do this.

Basically, I work on an alpha variable first. I first select observations based on whether the commas exist in the variable or not. With observations that contain commas, I then use strsplit to separate the numbers. Then I count how many numbers exist for each observation and duplicate each observation. Then I melt the broken numbers into a data frame with all the numbers in a variable called "value". I am simply replacing the alpha variable with the data in the molten value variable. Then I return data with data that does not contain commas. Then I use this df and work on the beta variable ....

Here is my solution (does it seem to work?):

 library(reshape2) demo$a<-grepl(",", demo$alpha) demo.atrue <- demo[ which(demo$a=='TRUE'), ] demo.afalse <- demo[ which(demo$a=='FALSE'), ] demo.atrue$alpha<-as.character(demo.atrue$alpha) temp<-strsplit(demo.atrue$alpha, ",") temp.lengths<-lapply(temp, length) for (i in 1:length(temp)) { df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3] } temp.melt<-melt(temp) df.expanded$alpha<-temp.melt$value demo.afalse<-demo.afalse[c(1:3)] demonew<-rbind(demo.afalse, df.expanded) demonew$b<-grepl(",", demonew$beta) demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ] demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ] demonew.btrue$beta<-as.character(demonew.btrue$beta) temp<-strsplit(demonew.btrue$beta, ",") temp.lengths<-lapply(temp, length) for (i in 1:length(temp)) { df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3] } temp.melt<-melt(temp) df.expanded1$beta<-temp.melt$value demonew.bfalse<-demonew.bfalse[c(1:3)] demonew1<-rbind(df.expanded1, demonew.bfalse) demonew1 #this seems to work, but doesn't feel very efficient 

And maybe not very effective, I'm not sure if this will work in all conditions. In particular, if for the same observation there are both plurals and variables "alpha" and "beta". I tested it with a few examples, and it looks fine, but I'm not sure about that.

Thanks for attention.

+7
r reshape2 gsub grepl
source share
2 answers

You can use my cSplit function , nested twice, for example:

 cSplit(cSplit(demo, "alpha", ",", "long"), "beta", ",", "long") # alpha beta option # 1: 6 8 apple # 2: 6 9 apple # 3: 6 10 apple # 4: 6 11 apple # 5: 9 6 pear # 6: 1 6 apple # 7: 3 8 pear # 8: 3 9 pear # 9: 3 6 lime # 10: 3 8 lime # 11: 3 1 apple # 12: 2 9 lime # 13: 4 9 lime # 14: 7 9 lime # 15: 11 9 lime 

Some guidelines:

More interesting data examples. 700 rows instead of 7 (still a pretty small data set) ...

 demo <- do.call(rbind, replicate(100, demo, FALSE)) library(data.table) demo2 <- data.table(demo) 

Functions for checking ...

 ## MrFlick's fun1 <- function() { do.call(rbind, with(demo, Map(expand.grid, alpha = strsplit(alpha,", "), beta = strsplit(beta, ", "), option = option ))) } ## Mine fun2 <- function() { cSplit(cSplit(demo2, "alpha", ",", "long"), "beta", ",", "long") } ## thelatemail one-liner fun3 <- function() { do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", ")))) } 

Actual benchmarking ...

 library(microbenchmark) microbenchmark(MF = fun1(), AM = fun2(), TH = fun3(), times = 10) # Unit: milliseconds # expr min lq median uq max neval # MF 785.34875 789.94924 800.11046 800.93643 813.62390 10 # AM 11.54569 11.93483 12.14181 12.31329 12.93208 10 # TH 790.46069 799.68518 803.47294 827.69520 899.11219 10 
+8
source share

In fact, this should not be too bad. First, for simplicity, I'm going to convert all columns to characters to simplify later layouts

 demo[] <- lapply(demo, as.character) 

Now do the hard work. Basically, I split the alpha and beta columns on the delimiter,. Then I will use expand.grid to combine all the elements "alpha", "beta" and "option". This will take care of repeating the necessary lines and will work if both alpha and beta have multiple meanings. Finally, I will merge all the newly created lines into one large large data.frame file. Here is the code

 do.call(rbind, with(demo, Map(expand.grid, alpha = strsplit(alpha,", "), beta = strsplit(beta, ", "), option = option ))) 

What is it. He will return

  alpha beta option 1 6 8 apple 2 6 9 apple 3 6 10 apple 4 6 11 apple 5 9 6 pear 6 1 6 apple 7 3 8 pear 8 3 9 pear 9 3 6 lime 10 3 8 lime 11 3 1 apple 12 2 9 lime 13 4 9 lime 14 7 9 lime 15 11 9 lime 
+5
source share

All Articles