Deleting duplicate rows depends on coefficient

Question

Deleting duplicate rows depends on coefficient

I want to remove duplicate rows from a data frame stratified by different headlights and conditions, such as highest average or sd.

Some data, a is the coefficient and identifier for the rows.

 set.seed(13654) a<- sort(c(1,1,4,1,2,3,2,3,1,5)) b<- matrix(runif(100,min = 6,max = 14),nrow = 10) c<- data.frame(a,b)

For example, I want to reduce the final data set in the rows with the highest average.

 # calculate means per row gr <- cbind(a,M=rowMeans(c[,-1])) # get rows stratified by a with highest mean: gr1 <- aggregate(M~a,gr,which.max) gr1 a M 1 1 3 2 2 2 3 3 1 4 4 1 5 5 1

Thus, the third row of factor 1, the second row of factor 2, ... should be included in the new data block. I want to avoid loops. What I tried was to split data and then use lapply , but so far it has not worked.

 cl <- split(c,a) # this function does not work it will select not the correct rows. lapply(cl, "[", gr1, )

My ultimate goal is this function:

 remove.dupl <- function(data,factor,method=c(highest.mean,highest.sd,lowest.sd,...))

Can you provide some hints or solution for my problem. After my workflow, I need “How-to” to correctly use "[" with lapply to select different rows from the data list.

+7

r duplicates

Jimbou Dec 23 '15 at 12:01

source share

2 answers

Try the by() function:

 set.seed(13654) a <- sort(c(1,1,4,1,2,3,2,3,1,5)) b <- matrix(runif(100,min = 6,max = 14),nrow = 10) c <- data.frame(a,b) myfun <- function(x) which.max(rowMeans(x)) # just replicating your example, you could define other functions here d <- by(data = c, INDICES = c$a, function(x) x[myfun(x), ]) # use by() to select rows, based on myfun() d <- do.call(rbind, d) # turn result of by() function into a data frame

+4

Han de vries Dec 23 '15 at 12:27

source share

Jaap · Accepted Answer · 2015-12-23T12:24:43+0000

Using the data.table package, I would approach it as follows:

 library(data.table) # method 1: setDT(cc)[, `:=` (rn = 1:.N, wm = which.max(rowMeans(.SD))), a][rn==wm] # method 2: setDT(cc)[, wm := frank(1/rowMeans(.SD), ties.method="first"), a][wm==1]

which gives:

  a X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 wm rn 1: 1 13.946254 7.302729 9.406389 8.924367 8.129423 10.174735 6.547805 11.618872 12.84100 9.494790 3 3 2: 2 13.606555 12.798149 11.261258 12.991822 12.875935 11.199411 8.551149 10.377451 13.63219 13.643163 2 2 3: 3 6.820769 13.748507 11.630297 11.559873 6.196406 8.925419 11.230415 10.584249 10.41442 6.821673 1 1 4: 4 8.418767 10.673998 6.693021 11.101287 7.855519 9.106210 12.279536 6.925023 6.92334 10.279204 1 1 5: 5 11.529072 7.940031 10.746172 8.535466 13.703122 12.294424 11.362498 11.256843 13.95535 13.264835 1 1

In the R database, you can do:

 cc$rm <- apply(cc[,-1], 1, mean) cc$wm <- ave(cc$rm, cc$a, FUN = function(x) max(x)==x) cc[cc$wm == 1,]

which gives:

  a X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 rm wm 3 1 13.946254 7.302729 9.406389 8.924367 8.129423 10.174735 6.547805 11.618872 12.84100 9.494790 9.838637 1 6 2 13.606555 12.798149 11.261258 12.991822 12.875935 11.199411 8.551149 10.377451 13.63219 13.643163 12.093708 1 7 3 6.820769 13.748507 11.630297 11.559873 6.196406 8.925419 11.230415 10.584249 10.41442 6.821673 9.793203 1 9 4 8.418767 10.673998 6.693021 11.101287 7.855519 9.106210 12.279536 6.925023 6.92334 10.279204 9.025591 1 10 5 11.529072 7.940031 10.746172 8.535466 13.703122 12.294424 11.362498 11.256843 13.95535 13.264835 11.458781 1

In response to your comment: Alternatively, you can use the rank function inside ave :

 # duplicate the row for which 'max(x)==x' for the first group cc <- rbind(cc,cc[3,]) cc$wm2 <- ave(cc$rm, cc$a, FUN = function(x) rank(-x, ties.method = "first")) cc[cc$wm2 == 1,]

which gives:

  a X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 rm wm wm2 3 1 13.946254 7.302729 9.406389 8.924367 8.129423 10.174735 6.547805 11.618872 12.84100 9.494790 9.838637 1 1 6 2 13.606555 12.798149 11.261258 12.991822 12.875935 11.199411 8.551149 10.377451 13.63219 13.643163 12.093708 1 1 7 3 6.820769 13.748507 11.630297 11.559873 6.196406 8.925419 11.230415 10.584249 10.41442 6.821673 9.793203 1 1 9 4 8.418767 10.673998 6.693021 11.101287 7.855519 9.106210 12.279536 6.925023 6.92334 10.279204 9.025591 1 1 10 5 11.529072 7.940031 10.746172 8.535466 13.703122 12.294424 11.362498 11.256843 13.95535 13.264835 11.458781 1 1

NOTE. I renamed the dataframe to cc , since it is better not to use the function name as the name for your data frame

Deleting duplicate rows depends on coefficient

More articles: