Calculating the most frequent category level using plyr

I would like to calculate the most frequent coefficient by category using plyr using the code below. The data frame b shows the requested result. Why c$mlevels have a numeric value?

 require(plyr) set.seed(0) a <- data.frame(cat=round(runif(100, 1, 3)), levels=factor(round(runif(100, 1, 10)))) mode <- function(x) names(table(x))[which.max(table(x))] b <- data.frame(cat=1:3, mlevels=c(mode(a$levels[a$cat==1]), mode(a$levels[a$cat==2]), mode(a$levels[a$cat==3]))) c <- ddply(a, .(cat), summarise, mlevels=mode(levels)) 
+4
source share
2 answers

When you use summarise , plyr doesn't seem to “see” the function declared in the global environment before checking the function in base :

We can verify this with the convenient Hadley pryr . You can install it with the following commands:

 library(devtools) install_github("pryr") require(pryr) require(plyr) c <- ddply(a, .(cat), summarise, print(where("mode"))) # <environment: namespace:base> # <environment: namespace:base> # <environment: namespace:base> 

In principle, it does not read / does not know / does not see your mode function. There are two alternatives. The first is what @AnandaMahto suggested, and I would do the same and advise you to stick with that. Another alternative is to not use summarise and call it with function(.) So that the mode function is visible in your global environment.

 c <- ddply(a, .(cat), function(x) mode(x$levels)) # cat V1 # 1 1 6 # 2 2 5 # 3 3 9 

Why does it work?

 c <- ddply(a, .(cat), function(x) print(where("mode"))) # <environment: R_GlobalEnv> # <environment: R_GlobalEnv> # <environment: R_GlobalEnv> 

Because, as you see above, it reads your function, which is in the global environment .

 > mode # your function # function(x) # names(table(x))[which.max(table(x))] > environment(mode) # where it sits # <environment: R_GlobalEnv> 

Unlike:

 > base::mode # base mode function # function (x) # { # some lines of code to compute mode # } # <bytecode: 0x7fa2f2bff878> # <environment: namespace:base> 

Here's an awesome wiki on Hadley's environments if you are interested in giving it a read / explore further.

+5
source

Your example uses only the names of existing functions: levels , cat and mode . As a rule, this does not create a big problem - for example, calling data.frame "df" does not violate the R df() function. But this almost always leads to more ambiguous or confusing code, in which case it made things break. Arun's answer perfectly shows why.

You can easily fix your problem by renaming your mode function. In the example below, I simplified it a bit in addition to renaming, and it works as you expected.

 Mode <- function(x) names(which.max(table(x))) ddply(a, .(cat), summarise, mlevels=Mode(levels)) # cat mlevels # 1 1 6 # 2 2 5 # 3 3 9 

Of course, there is a very cumbersome workaround: use get and specify where to look for the function.

 > mode <- function(x) names(table(x))[which.max(table(x))] > ddply(a, .(cat), summarise, mlevels = get("mode", ".GlobalEnv")(levels)) cat mlevels 1 1 6 2 2 5 3 3 9 
+2
source

All Articles