I programmed in Stata the past few years and recently switched to R about 4 months ago.
I have data in the following format:
popname sex year age COUNTRY 329447 AUS f 1921 23 AUS 329448 AUS f 1921 24 AUS 329449 AUS f 1921 25 AUS 329450 AUS f 1921 26 AUS 329451 AUS f 1921 27 AUS 329452 AUS f 1921 28 AUS ... 329532 AUS f 1922 23 AUS 329533 AUS f 1922 24 AUS 329534 AUS f 1922 25 AUS ... ... . .. .. ... 297729 BLR f 1987 59 BLR 297730 BLR f 1987 60 BLR 297731 BLR f 1987 61 BLR ... 291941 BLR m 1973 71 BLR 291942 BLR m 1973 72 BLR 291993 BLR m 1974 23 BLR
I would like to create a new summary variable, Max.Age (which calculates the maximum age for a given subgroup, determined by {popname, sex, year) in an existing dataset as follows:
popname sex year age COUNTRY max.age 329447 AUS f 1921 23 AUS 72 329448 AUS f 1921 24 AUS 72 329449 AUS f 1921 25 AUS 72 329450 AUS f 1921 26 AUS 72 329451 AUS f 1921 27 AUS 72 329452 AUS f 1921 28 AUS 72 ... 329532 AUS f 1922 23 AUS 75 329533 AUS f 1922 24 AUS 75 329534 AUS f 1922 25 AUS 75 ... ... . .. .. ... 297729 BLR f 1987 59 BLR 87 297730 BLR f 1987 60 BLR 87 297731 BLR f 1987 61 BLR 87 ... 291941 BLR m 1973 71 BLR 78 291942 BLR m 1973 72 BLR 78 291993 BLR m 1974 23 BLR 78
To do this in Stata, you can use the egen command with the by command:
by State City Day, sort: egen cnt=seq(), from(23) to(72) block(1);
I tried to do this in R using the doBy package. Here is the code I wrote:
IDB <- orderBy(~popname+sex+year+age, data=IDB) v<-lapplyBy(~sex+year, data=IDB, function(d) c(NA,max(d$age))) IDB$Max.age <- unlist(v)
This does not work because lapplyBy returns an aggregated dataset shorter than the original dataset (IDB).
Can someone kindly point me in the right direction, how to implement Stata code like "by | egen" in R substantially?
thanks