Change R data with user records in rows, collapse for each user

Sorry for my novelty in the world of R, thank you in advance for your help.

I would like to analyze the data from the experiment.

The data comes in a long format, and it needs to be converted to wide, but I can’t determine exactly how to do it. Most melt / cast and rebuild examples deal with much simpler data frames.

Each time a subject answers a question about an experiment, his identifier, location, age and gender are recorded on one line, then his experimental data on a number of questions are entered next to these variables. Here, they can answer any number of questions in the experiment, and they can answer different subjects (this is quite difficult, but it should be so).

Raw data looks something like this:

User_id, location, age, gender, Item, Resp 1, CA, 22, M, A, 1 1, CA, 22, M, B, -1 1, CA, 22, M, C, -1 1, CA, 22, M, D, 1 1, CA, 22, M, E,-1 2, MD, 27, F, A, -1 2, MD, 27, F, B, 1 2, MD, 27, F, C, 1 2, MD, 27, F, E, 1 2, MD, 27, F, G, -1 2, MD, 27, F, H, -1 

I would like to modify this data so that each user is on the same line to look like this:

 User_id, location, age, gender, A, B, C, D, E, F, G, H 1, CA, 22, M, 1, -1, -1, 1, -1, 0, 0, 0, 2, MD, 27, F, -1, 1, 1, 1, 0, 1, -1, -1 

I think it’s just a matter of finding the right equation of change, but I have been with him for a couple of hours, and I can’t get what I want, it also looks like since most of the examples do not have duplicate demographic data, and therefore they can simply be turned easier. It is unfortunate if I missed something simple.

+7
r reshape
source share
3 answers

Using data.table , you can:

 library(data.table) > dcast(dt, User_id + location + age ~ Item, value.var = "Resp", fill = 0L) User_id location age ABCDEGH 1: 1 CA 22 1 -1 -1 1 -1 0 0 2: 2 MD 27 -1 1 1 0 1 -1 -1 
+11
source share

Theres a package called tidyr , which greatly facilitates melting and changing data formats. In your case, you can use tidyr::spread directly:

 result = spread(df, Item, Resp) 

This, however, will fill in the missing entries with NA :

  User_id location age gender ABCDEGH 1 1 CA 22 M 1 -1 -1 1 -1 NA NA 2 2 MD 27 F -1 1 1 NA 1 -1 -1 

You can fix this by replacing them:

 result[is.na(result)] = 0 result # User_id location age gender ABCDEGH # 1 1 CA 22 M 1 -1 -1 1 -1 0 0 # 2 2 MD 27 F -1 1 1 0 1 -1 -1 

... or using the fill argument:

 result = spread(df, Item, Resp, fill = 0) 

For completeness, the other way around (for example, playing the original data.frame ) works through gather (usually called “melting”):

 gather(result, Item, Resp, A : H) 

- The last argument here tells gather which columns are going to (and it supports the compressed range syntax).

+10
source share

There is always an elegant version of stats::reshape

 (newdf <- reshape(df, direction = "wide", timevar = "Item", idvar = names(df)[1:4])) # User_id location age gender Resp. A Resp. B Resp. C Resp. D Resp. E Resp. G Resp. H # 1 1 CA 22 M 1 -1 -1 1 -1 NA NA # 6 2 MD 27 F -1 1 1 NA 1 -1 -1 

Missing values ​​are filled with NA in reshape() , and the names are not what we want. Therefore, we will need to work a little. Here we can change the names and replace NA with zero on the same line to achieve the desired result.

 replace(setNames(newdf, sub(".* ", "", names(newdf))), is.na(newdf), 0) # User_id location age gender ABCDEGH # 1 1 CA 22 M 1 -1 -1 1 -1 0 0 # 6 2 MD 27 F -1 1 1 0 1 -1 -1 

Of course, the code will definitely be more legible if we split it into two separate lines. Also note that in your source data there is no F in Item , therefore, there is a difference in the output from yours.

Data:

 df <- structure(list(User_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c(" CA", " MD"), class = "factor"), age = c(22L, 22L, 22L, 22L, 22L, 27L, 27L, 27L, 27L, 27L, 27L), gender = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" F", " M" ), class = "factor"), Item = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 5L, 6L, 7L), .Label = c(" A", " B", " C", " D", " E", " G", " H"), class = "factor"), Resp = c(1, -1, -1, 1, -1, -1, 1, 1, 1, -1, -1)), .Names = c("User_id", "location", "age", "gender", "Item", "Resp"), class = "data.frame", row.names = c(NA, -11L )) 
+10
source share

All Articles