Choose between duplicate data in a data frame

Question

Choose between duplicate data in a data frame

I previously asked a question about retrieving duplicate rows from a data frame. Now I need to run a script to decide which of these duplicates to keep in my final data set.

Duplicate records in this dataset have the same “Assay” and “Sample” values. Here are the first 10 lines of a new Im dataset working with my duplicate records:

Assay Sample Genotype Data 1 CCT6-002 1486 A 1 2 CCT6-002 1486 G 0 3 CCT6-002 1997 G 0 4 CCT6-002 1997 NA NA 5 CCT6-002 0050 G 0 6 CCT6-002 0050 G 0 7 CCT6-015 0082 G 0 8 CCT6-015 0082 T 1 9 CCT6-015 0121 G 0 10 CCT6-015 0121 NA NA

I like to run a script that breaks these repeating selections into 4 cells based on the value for "Data", which can be either 1, 0, or NA:

  1) All values for 'Data' are NA 2) All values for 'Data' are identical, no NA 3) At least 1 value for 'Data' is not identical, no NA. 4) At least 1 value for 'Data' is not identical, at least one is NA.

The expected result from the above data will look as follows:

 Set 1 Null Set 2 5 CCT6-002 0050 G 0 6 CCT6-002 0050 G 0 Set 3 1 CCT6-002 1486 A 1 2 CCT6-002 1486 G 0 7 CCT6-015 0082 G 0 8 CCT6-015 0082 T 1 Set 4 3 CCT6-002 1997 G 0 4 CCT6-002 1997 NA NA 9 CCT6-015 0121 G 0 10 CCT6-015 0121 NA NA

There are cases when there are more than two “duplicated” data points in this dataset. I'm not even sure where to start with this as I'm new to R.

EDIT: with expected data.

+4

r duplicates dataframe binning

Sam globus Oct 20 '11 at 20:29

source share

2 answers

You asked a question that leads to asking others to do all your work for you. The question of one particular piece of this project is likely to be more likely to attract an answer. The part you are struggling with is preventing you from starting - this is a very simple programming skill: the ability to break down your problem into small specific steps, solve each one individually, and then combine them again to solve your original problem.

This skill is also very difficult to learn. But you have a good start! You have well specified four groups that your data may fall into:

All values for "data": NA
All values for "data" are identical, no NA
At least 1 value for "Data" is not identical, no NA.
At least 1 value for "Data" is not identical, at least one of them is NS.

Now you need to think about how, if you have only one subset of your data, can you figure out how to determine in R the group (1-4) in which it is located? Below is a sketch of some tools that may be useful for this. Create several subsets and play in the console until you feel the convenience of identifying each group:

(1) All values for datSub$Data NA s?

Tools: all and is.na

(2) Only one unique value, not NA ?

Tools: length , unique , is.na , any

(3) More than one unique value, no NA s?

Tools: length , unique , any , is.na

(4) More than one unique value of at least one NA ?

Tools: length , unique , any , is.na

It may be possible without using all of these features, but all of them are potentially useful.

As soon as you learn how to determine which group a particular subset should be in, you are ready to transfer this code to the function. My suggestions are to create a new column with a value of 1-4 depending on the group the subset belongs to:

 myFun <- function(x){ if (...){ x$grp <- 1 } if (...){ x$grp <- 2 } #etc. return(x) }

Then use ddply to apply this function to each subset of your data based on Sample values:

 ddply(dat,.(Sample),.fun = myFun)

And finally, let's split this data frame into its new grp variable:

 split(dat,dat$grp)

Hope this general sketch helps you get started. But you will have problems. They do everything. If you encounter certain problems along the way, feel free to ask another question.

In fact, now I see that John posted the answer according to my sketch. However, I will post this answer anyway in the hope that it will help you analyze future problems.

+4

joran Oct 21 '11 at 0:06

source share

John colby · Accepted Answer · 2011-10-21T00:03:20+0000

This should be a good start. Depending on how long your dataset is, it may or may not be worth optimizing this for better speed.

 require(plyr) # Read data data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA)) # Function to pick set pickSet <- function(x) { if(all(is.na(x$Data))) { set = 1 } else if(length(unique(x$Data)) == 1) { set = 2 } else if(!any(is.na(x$Data))) { set = 3 } else { set = 4 } data.frame(Set=set) } # Identify Set for each combo of Assay and Sample sets = ddply(data, c('Assay', 'Sample'), pickSet) # Merge set info back with data data = join(data, sets) # Reformat to list sets.list = lapply(1:4, function(x) data[data$Set==x,-5])

 > sets.list [[1]] [1] Assay Sample Genotype Data <0 rows> (or 0-length row.names) [[2]] Assay Sample Genotype Data 5 CCT6-002 0050 G 0 6 CCT6-002 0050 G 0 [[3]] Assay Sample Genotype Data 1 CCT6-002 1486 A 1 2 CCT6-002 1486 G 0 7 CCT6-015 0082 G 0 8 CCT6-015 0082 T 1 [[4]] Assay Sample Genotype Data 3 CCT6-002 1997 G 0 4 CCT6-002 1997 <NA> NA 9 CCT6-015 0121 G 0 10 CCT6-015 0121 <NA> NA

Choose between duplicate data in a data frame

More articles: