R: create new columns based on nested loops

I am an applied researcher working mainly with nationwide registry data, which makes the transition from Stata to R. The dplyr package made most of my daily data management tasks run smoothly. However, I am currently struggling with getting R to generate new variables based on nested loops.

Suppose we have the following dataset for six participants born between 1990-1992, with measures by their average for the period 2001-2004.

 * Stata clear all input id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004 1 1990 1.2 1.3 1.4 1.5 1.3 2 1990 2.3 2.5 2.2 2.1 2.6 3 1991 3.1 3.9 3.4 3.5 4.0 4 1991 2.6 3.1 2.4 1.9 3.1 5 1992 1.4 1.8 3.2 2.3 3.2 6 1992 3.5 4.0 4.0 4.0 3.9 end list +--------------------------------------------------------------+ | id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004 | |--------------------------------------------------------------| 1. | 1 1990 1.2 1.3 1.4 1.5 1.3 | 2. | 2 1990 2.3 2.5 2.2 2.1 2.6 | 3. | 3 1991 3.1 3.9 3.4 3.5 4 | 4. | 4 1991 2.6 3.1 2.4 1.9 3.1 | 5. | 5 1992 1.4 1.8 3.2 2.3 3.2 | 6. | 6 1992 3.5 4 4 4 3.9 | +--------------------------------------------------------------+ 

Or, which is the same, in R:

 df <- read.table(header=T, text="id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004 1 1990 1.2 1.3 1.4 1.5 1.3 2 1990 2.3 2.5 2.2 2.1 2.6 3 1991 3.1 3.9 3.4 3.5 4.0 4 1991 2.6 3.1 2.4 1.9 3.1 5 1992 1.4 1.8 3.2 2.3 3.2 6 1992 3.5 4.0 4.0 4.0 3.9 ") 

Now I would like to create three new variables that measure every GPA member aged 10 to 12 years old (gpa_age10 ... gpa_age12).

In Stata, I usually did this through nested loops:

 forval i = 10/12 { gen gpa_age`i' = . forval j = 1990/1992 { replace gpa_age`i' = gpa`=`j'+`i'' if byear == `j' } } 

This will result in the following dataset:

  +-----------------------------------------------------------------------------------------------+ | id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004 gpa_a~10 gpa_a~11 gpa_a~12 | |-----------------------------------------------------------------------------------------------| 1. | 1 1990 1.2 1.3 1.4 1.5 1.3 1.2 1.3 1.4 | 2. | 2 1990 2.3 2.5 2.2 2.1 2.6 2.3 2.5 2.2 | 3. | 3 1991 3.1 3.9 3.4 3.5 4 3.9 3.4 3.5 | 4. | 4 1991 2.6 3.1 2.4 1.9 3.1 3.1 2.4 1.9 | 5. | 5 1992 1.4 1.8 3.2 2.3 3.2 3.2 2.3 3.2 | 6. | 6 1992 3.5 4 4 4 3.9 4 4 3.9 | +-----------------------------------------------------------------------------------------------+ 

I understand that there cannot be a direct translation of this Stata code to R, but what is the best way to replicate these results to R?

+4
source share
2 answers

You can change the data.frame form to a form where each row represents a year for a student using the reshape2 package. Then calculating age becomes trivial. Here is the complete code to complete this task, assuming your data.frame on top is in a variable called dat :

 mdat <- melt(dat, id.vars=c('id', 'byear'), value.name='gpa') mdat %>% mutate(year=as.numeric(gsub('gpa', '', variable))) %>% select(id, byear, year, gpa) %>% mutate(age=year-byear) 

Alternatively, you can get the requested data.frame by casting the molten data.frame:

 dcast(mdat, id + byear ~ age, value.var='gpa') > id byear 8 9 10 11 12 13 14 > 1 1990 NA NA 1.2 1.3 1.4 1.5 1.3 > 2 1990 NA NA 2.3 2.5 2.2 2.1 2.6 > 3 1991 NA 3.1 3.9 3.4 3.5 4.0 NA > 4 1991 NA 2.6 3.1 2.4 1.9 3.1 NA > 5 1992 1.4 1.8 3.2 2.3 3.2 NA NA > 6 1992 3.5 4.0 4.0 4.0 3.9 NA NA 
+3
source

I know that the question was handled perfectly with @ cr1msonB1ade, but in order to show the OP nested for the loop version in R, to match the published Stata code:

 for (i in 10:12) { for (j in 1990:1992) { gpadf[[paste0("gpa_age", i)]][gpadf$byear==j] <- gpadf[[paste0("gpa", j+i)]][gpadf$byear==j] } } 
+2
source

All Articles