An elegant way to reduce rare factors from a data frame

I want a subset of dataframe by coefficient. I only want to keep factor levels above a certain frequency.

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12)) 

This code creates a data frame:

  factor variable 1 a -1.55902013 2 a 0.22355431 3 a -1.52195456 4 a -0.32842689 5 a 0.85650212 6 b 0.00962240 7 b -0.06621508 8 b -1.41347823 9 b 0.08969098 10 b 1.31565582 11 c -1.26141417 12 c -0.33364069 

And I want to lower the levels of factors that are repeated less than 5 times. I developed for-loop and it works:

 for (i in 1:length(levels(df$factor))){ if(table(df$factor)[i] < 5){ df.new <- df[df$factor != names(table(df$factor))[i],] } } 

But are there faster and more beautiful solutions?

+7
r subset
source share
6 answers

What about

 df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),] 
+6
source share
 require(dplyr) df %>% group_by(factor) %>% filter(n() >= 5) #factor variable #1 a 2.0769363 #2 a 0.6187513 #3 a 0.2426108 #4 a -0.4279296 #5 a 0.2270024 #6 b -0.6839748 #7 b -0.3285610 #8 b 0.2625743 #9 b -0.9532957 #10 b 1.4526317 
+8
source share
 library(data.table) setDT(df)[, variable[.N >= 5], by = factor] ## factor V1 ## 1: a -0.8204684 ## 2: a 0.4874291 ## 3: a 0.7383247 ## 4: a 0.5757814 ## 5: a -0.3053884 ## 6: b 1.5117812 ## 7: b 0.3898432 ## 8: b -0.6212406 ## 9: b -2.2146999 ## 10: b 1.1249309 
+5
source share

Until recently, I would agree with the group_by + filter. However, with the new forcats package from tidyverse another solution would be

 require(forcats) require(dplyr) df %>% filter(fct_lump(factor, n=5) != "Other") 

We could also make it more expressive by using NA for the low-frequency category:

 df %>% filter(!is.na(fct_lump(factor, n=5, other_level=NA))) 
+4
source share

Maybe joins a filtered account of factors:

 library(dplyr) common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) df.1 <- semi_join(df, common.factors) 
+3
source share

Try this with basic features ...

 lvl = as.data.frame(table(df$factor)) colnames(lvl) = c('factor','count') lvl factor count 1 a 5 2 b 5 3 c 2 df[df$factor %in% lvl[lvl$count>=5,]$factor,] factor variable 1 a -0.01619026 2 a 0.94383621 3 a 0.82122120 4 a 0.59390132 5 a 0.91897737 6 b 0.78213630 7 b 0.07456498 8 b -1.98935170 9 b 0.61982575 10 b -0.05612874 
0
source share

All Articles