Deleting rows based on invalid duplicate data in large dataset in R

Question

Deleting rows based on invalid duplicate data in large dataset in R

I want to make a 4-day moving average over a large dataset. The problem is that some people do not have 4 cases, and thus I get an error message indicating that k <= n is not TRUE.

Is there a way to remove any person who does not have enough data in the data set?

Here is an example of how the data will look:

Name variable.1 1 Kim 64.703950 2 Kim 926.339849 3 Kim 128.662977 4 Kim 290.888594 5 Kim 869.418523 6 Bob 594.973849 7 Bob 408.159544 8 Bob 609.140928 9 Joseph 496.779712 10 Joseph 444.028668 11 Joseph -213.375635 12 Joseph -76.728981 13 Joseph 265.642784 14 Hank -91.646728 15 Hank 170.209746 16 Hank 97.889889 17 Hank 12.069074 18 Hank 402.361731 19 Earl 721.941796 20 Earl 4.823148 21 Earl 696.299627

+5

r dataset dataframe

user3585829 May 06 '15 at 19:00

source share

4 answers

davechilders · Answer 1 · 2015-05-06T19:03:22+0000

If your data frame is df , you can remove all names that occur less than 4 times with dplyr :

 library(dplyr) df %>% group_by(Name) %>% filter(n() >= 4)

Steven beaupré · Answer 2 · 2015-05-06T19:07:24+0000

Try:

 library(zoo) library(dplyr) df %>% group_by(Name) %>% filter(n() >= 4) %>% mutate(daymean = rollmean(variable.1, 4, align="right", na.pad=TRUE))

This will cause the groups to be greater than or equal to 4 and calculate the 4-day moving average on variable.1 .

 # Name variable.1 daymean #1 Kim 64.70395 NA #2 Kim 926.33985 NA #3 Kim 128.66298 NA #4 Kim 290.88859 352.6488 #5 Kim 869.41852 553.8275 #6 Joseph 496.77971 NA #7 Joseph 444.02867 NA #8 Joseph -213.37563 NA #9 Joseph -76.72898 162.6759 #10 Joseph 265.64278 104.8917 #11 Hank -91.64673 NA #12 Hank 170.20975 NA #13 Hank 97.88989 NA #14 Hank 12.06907 47.1305 #15 Hank 402.36173 170.6326

Quinn weber · Answer 3 · 2015-05-06T19:05:05+0000

You can create a second data.frame, which is aggregated to the user level, with an account for each user. Then attach this data.frame file to the original user, then multiply the new data.frame, where count> = 4

Brodieg · Answer 4 · 2015-05-06T19:10:48+0000

Here are two options in the database: one with ave , where we create a vector that has, for each row in the group, the length of this group ( ave will process its result to fill the group):

 subset(DF, ave(seq(Name), Name, FUN=length) > 4)

And one more table , where we count the elements in each group and use %in% to save only rows belonging to groups with a sufficient number of elements.

 subset(DF, Name %in% names(table(Name)[table(Name) > 4]))

Both produce:

  Name variable.1 1 Kim 64.70395 2 Kim 926.33985 3 Kim 128.66298 4 Kim 290.88859 5 Kim 869.41852 9 Joseph 496.77971 10 Joseph 444.02867 11 Joseph -213.37563 12 Joseph -76.72898 13 Joseph 265.64278 14 Hank -91.64673 15 Hank 170.20975 16 Hank 97.88989 17 Hank 12.06907 18 Hank 402.36173

Deleting rows based on invalid duplicate data in large dataset in R

More articles: