Combine top_n with the category Other in dplyr

I have a dat1 data frame

Country Count 1 AUS 1 2 NZ 2 3 NZ 1 4 USA 3 5 AUS 1 6 IND 2 7 AUS 4 8 USA 2 9 JPN 5 10 CN 2 

First I want to summarize "Count" for "Country". Then, the top 3 totals for each country should be combined with an additional row of “Other”, which is the sum of countries that are not in the top three.

Thus, the expected result:

  Country Count 1 AUS 6 2 JPN 5 3 USA 5 4 Others 7 

I tried the code below but couldn’t figure out how to place the string “Others”.

 dat1 %>% group_by(Country) %>% summarise(Count = sum(Count)) %>% arrange(desc(Count)) %>% top_n(3) 

This code currently gives:

  Country Count 1 AUS 6 2 JPN 5 3 USA 5 

Any help would be greatly appreciated.

 dat1 <- structure(list(Country = structure(c(1L, 5L, 5L, 6L, 1L, 3L, 1L, 6L, 4L, 2L), .Label = c("AUS", "CN", "IND", "JPN", "NZ", "USA"), class = "factor"), Count = c(1L, 2L, 1L, 3L, 1L, 2L, 4L, 2L, 5L, 2L)), .Names = c("Country", "Count"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")) 
+6
source share
6 answers

Instead of top_n this seems like a good argument for the convenience of the tally function. It uses summarise , sum and arrange under the hood.

Then use factor to create the Other category. Use the levels argument to set Other as the last level. Then “Other” will be placed last in the table (and in any subsequent graph of the result).

If the "Country" factor in your source data, you can wrap Country[1:3] in as.character .

 group_by(df, Country) %>% tally(Count, sort = TRUE) %>% group_by(Country = factor(c(Country[1:3], rep("Other", n() - 3)), levels = c(Country[1:3], "Other"))) %>% tally(n) # Country n # (fctr) (int) #1 AUS 6 #2 JPN 5 #3 USA 5 #4 Other 7 
+7
source

We could do this in two steps: first create a sorted data.frame file, and then rbind top three lines with a summary of the last lines:

 d <- df %>% group_by(Country) %>% summarise(Count = sum(Count)) %>% arrange(desc(Count)) rbind(top_n(d,3), slice(d,4:n()) %>% summarise(Country="other",Count=sum(Count)) ) 

Output

  Country Count (fctr) (int) 1 AUS 6 2 JPN 5 3 USA 5 4 other 7 
+3
source

The data.table option is used data.table . We convert "data.frame" to "data.table" ( setDT(dat1) ), grouped by "Country", we get sum "Count", then order by "Count", we rbind first three observations with the list "Other" and sum "Count" of the remaining observations.

 library(data.table) setDT(dat1)[, list(Count=sum(Count)), Country][order(-Count), rbind(.SD[1:3], list(Country='Others', Count=sum(.SD[[2]][4:.N]))) ] # Country Count #1: AUS 6 #2: USA 5 #3: JPN 5 #4: Others 7 

Or using base R

  d1 <- aggregate(.~Country, dat1, FUN=sum) i1 <- order(-d1$Count) rbind(d1[i1,][1:3,], data.frame(Country='Others', Count=sum(d1$Count[i1][4:nrow(d1)]))) 
+3
source

You can even use xtabs() and manipulate the result. This is the basic answer of R.

 s <- sort(xtabs(Count ~ ., dat1), decreasing = TRUE) setNames( as.data.frame(as.table(c(head(s, 3), Others = sum(tail(s, -3)))), names(dat1) ) # Country Count # 1 AUS 6 # 2 JPN 5 # 3 USA 5 # 4 Others 7 
+3
source

A feature that may be useful:

 top_cases = function(v, top, other = 'other'){ cv = class(v) v = as.character(v) v[factor(v, levels = top) %>% is.na()] = other if(cv == 'factor') v = factor(v, levels = c(top, other)) v } 

eg..

 > table(state.region) state.region Northeast South North Central West 9 16 12 13 > top_cases(state.region, c('South','West'), 'North') %>% table() . South West North 16 13 21 iris %>% mutate(Species = top_cases(Species, c('setosa','versicolor'))) 
+2
source

For those who are interested in the case for getting categories consisting of more than a certain percentage, placed in the category "others", here is some kind of code.

To do this, any values ​​less than 5% go into the category of "others", and the "other" category is summed up and includes a label for the number of categories aggregated in the category of "others".

 othernum <- nrow(sub[(sub$value<.05),]) sub<- subset(sub, value >.05) toplot <- rbind(sub,c(paste("Other (",othernum," types)", sep=""), 1-sum(sub$value))) 
0
source

All Articles