Returns df with column values that occur more than once

Question

Returns df with column values that occur more than once

I have a df data frame, and I am trying to multiply all the rows that matter in column B that occur more than once in the data set.

I tried using the table to do this, but I was having problems with a subset of the table:

 t<-table(df$B)

Then I will try a subset using:

 subset(df, table(df$B)>1)

And I get an error

"Error in x [subset and! Is.na (subset)]: an object of type closure is not a subset."

How can I multiply my data frame using table counting?

+8

r dataframe subset

Chris Robles Jul 01 '14 at 5:52

source share

3 answers

mnel · Answer 1 · 2014-07-01 06:17

Here is the dplyr solution (using mrFlick data.frame)

 library(dplyr) newd <- dd %>% group_by(b) %>% filter(n()>1) # newd # ab # 1 1 1 # 2 2 1 # 3 5 4 # 4 6 4 # 5 7 4 # 6 9 6 # 7 10 6

Or using data.table

 setDT(dd)[,if(.N >1) .SD,by=b]

Or using the base R

 dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]

Mike.Gahan · Answer 2 · 2014-07-01 05:57

Can I suggest an alternative, faster way to do this with data.table ?

 require(data.table) ## 1.9.2 setDT(df)[, .N, by=B][N > 1L]$B

(or) you can bind .I (another special variable - see ?data.table ), which gives the number of the corresponding line in df , as well as .N as follows:

 setDT(df)[df[, .I[.N > 1L], by=B]$V1]

(or) look at @mnel different for another option (using another special .SD variable).

MrFlick · Answer 3 · 2014-07-01 06:08

Using table() not the best, because then you have to connect to the source lines of data.frame. The ave function makes it easy to calculate line level values for different groups. for example

 dd<-data.frame( a=1:10, b=c(1,1,2,3,4,4,4,5,6, 6) ) dd[with(dd, ave(b,b,FUN=length))>1, ] #subset(dd, ave(b,b,FUN=length)>1) #same thing ab 1 1 1 2 2 1 5 5 4 6 6 4 7 7 4 9 9 6 10 10 6

Here for each level b it calculates the length b , which is actually just the number b and returns this back to the corresponding line for each value. Then we use this subset.

Returns df with column values ​​that occur more than once

More articles:

Returns df with column values that occur more than once