Calculation of summary statistics on subsets of a data set [What is the equivalent of Stata "bysort" in R?]

I programmed in Stata the past few years and recently switched to R about 4 months ago.

I have data in the following format:

popname sex year age COUNTRY 329447 AUS f 1921 23 AUS 329448 AUS f 1921 24 AUS 329449 AUS f 1921 25 AUS 329450 AUS f 1921 26 AUS 329451 AUS f 1921 27 AUS 329452 AUS f 1921 28 AUS ... 329532 AUS f 1922 23 AUS 329533 AUS f 1922 24 AUS 329534 AUS f 1922 25 AUS ... ... . .. .. ... 297729 BLR f 1987 59 BLR 297730 BLR f 1987 60 BLR 297731 BLR f 1987 61 BLR ... 291941 BLR m 1973 71 BLR 291942 BLR m 1973 72 BLR 291993 BLR m 1974 23 BLR 

I would like to create a new summary variable, Max.Age (which calculates the maximum age for a given subgroup, determined by {popname, sex, year) in an existing dataset as follows:

  popname sex year age COUNTRY max.age 329447 AUS f 1921 23 AUS 72 329448 AUS f 1921 24 AUS 72 329449 AUS f 1921 25 AUS 72 329450 AUS f 1921 26 AUS 72 329451 AUS f 1921 27 AUS 72 329452 AUS f 1921 28 AUS 72 ... 329532 AUS f 1922 23 AUS 75 329533 AUS f 1922 24 AUS 75 329534 AUS f 1922 25 AUS 75 ... ... . .. .. ... 297729 BLR f 1987 59 BLR 87 297730 BLR f 1987 60 BLR 87 297731 BLR f 1987 61 BLR 87 ... 291941 BLR m 1973 71 BLR 78 291942 BLR m 1973 72 BLR 78 291993 BLR m 1974 23 BLR 78 

To do this in Stata, you can use the egen command with the by command:

 by State City Day, sort: egen cnt=seq(), from(23) to(72) block(1); 

I tried to do this in R using the doBy package. Here is the code I wrote:

 IDB <- orderBy(~popname+sex+year+age, data=IDB) v<-lapplyBy(~sex+year, data=IDB, function(d) c(NA,max(d$age))) IDB$Max.age <- unlist(v) 

This does not work because lapplyBy returns an aggregated dataset shorter than the original dataset (IDB).

Can someone kindly point me in the right direction, how to implement Stata code like "by | egen" in R substantially?

thanks

+4
source share
4 answers

One thing you'll find in R is that there is more than one way to do something. One way is through the ave function.

 IDB$max.age <- ave(IDB$age, IDB$popname, IDB$sex, IDB$year, FUN=max) 
+5
source

I would recommend using ddply from the ddply package (although there are many ways to do something like this). Assuming your framework is called dat :

 result <- ddply(dat,.(popname,sex,year),.fun = function(x){ x$max.age <- max(x$age,na.rm=TRUE) return(x)}) 

An anonymous function in ddply adds a column to each part with a maximum age for that part.

+4
source

I found that the Stata egen documentation was completely opaque when I tried to read it a couple of years ago, so I will not give you a general answer. The function used for this purpose (returning a vector of the same length from a function applied to groups is ave() :

 dfrm$max.age <- with( dfrm, ave(age, list(popname, sex,year), FUN=max, na.rm=TRUE) ) 

You receive warnings, but the operation completes successfully. Perhaps the cross-product of grouping variables creates empty categories that are subsequently discarded. They also occur with the Joshua version, and removing na.rm = TRUE does not change the warning:

 1: In FUN(X[[20L]], ...) : no non-missing arguments to max; returning -Inf 
+3
source

It is easy to do using dplyr

 library(dplyr) IDB %>% group_by(popname, sex, year) %>% mutate(max.age = max(age)) 
0
source

All Articles