An easy way to remove instance-resistant data rows of data where none of the rows match exception criteria

One common task when manipulating data in R is to bind to a data frame by deleting rows that match certain criteria. However, a simple way to do this in R seems logically inconsistent and even dangerous for the inexperienced (like me).

Suppose we have a data frame, and we want to exclude lines related to processing "G1":

Treatment=c("G1","G1","G1","G1","G1","G1","G2","G2","G2","G2","G2", "G2","G3","G3","G3","G3","G3","G3") Vals=c(runif(6),runif(6)+0.9,runif(6)-0.3) data=data.frame(Treatment) data=cbind(data, Vals) 

As expected, the code below deletes rows of data that match the criteria of the first row

 to_del=which(data$Treatment=="G1") new_data=data[-to_del,] new_data 

However, contrary to what is expected, using this approach, if the β€œwhich” command does not find ANY matching line, this code deletes all the lines, and does not leave them alone

 to_del=which(data$Treatment=="G4") new_data=data[-to_del,] new_data 

The above code leads to a data frame without any remaining rows, which makes no sense (i.e., since R does not detect rows that match my criteria for deletion, it deletes all rows). My work works, but I would suggest that there is an easier way to do this without all these conditional statements

 ###WORKAROUND to_del=which(data$Treatment=="G4") #no G4 treatment in this particular data frame if (length(to_del)>0){ new_data=data[-to_del,] }else{ new_data=data } new_data 

Does anyone have an easy way to do this that works, even if none of the lines match the specified criteria?

+7
source share
4 answers

You have encountered a common problem when using which . Use != Instead.

 new_data <- data[data$Treatment!="G4",] 

The problem is that which returns integer(0) if all elements are FALSE . This will still be a problem, even if which returns 0 , because a subset of zero also returns integer(0) :

 R> # subsetting by zero (positive or negative) R> (1:3)[0] # same as (1:3)[-0] integer(0) 

You will also encounter problems if a subset of NA :

 R> # subsetting by NA R> (1:3)[NA] [1] NA NA NA 
+6
source

Why not use a subset ?

 subset(data, ! rownames(data) %in% to_del ) 

(In any case, you were implicitly mapped to the names of the growths in the data[-to_del, ] examples data[-to_del, ] ). Of course, once this works, you can return to using only "["

 data[ ! rownames(data) %in% to_del , ] 
+3
source

I like to use data.table for a subset because it is more intuitive, shorter, and faster with large data sets.

 library(data.table) data.dt<-as.data.table(data) setkey(data.dt, Treatment) data.dt[!"G1",] ## Treatment Vals ## 1: G2 0.90264622 ## 2: G2 1.47842130 ## 3: G2 1.52494735 ## 4: G2 1.46373958 ## 5: G2 1.12850658 ## 6: G2 1.46705561 ## 7: G3 0.58451869 ## 8: G3 -0.20231228 ## 9: G3 0.52519475 ## 10: G3 0.62956475 ## 11: G3 -0.06655426 ## 12: G3 0.56814703 data.dt[!"G4",] ## Treatment Vals ## 1 G1 0.93411692 ## 2 G1 0.60153972 ## 3 G1 0.28147464 ## 4 G1 0.97264924 ## 5 G1 0.50804831 ## 6 G1 0.48273876 ## 7 G2 0.90264622 ## 8 G2 1.47842130 ## 9 G2 1.52494735 ## 10 G2 1.46373958 ## 11 G2 1.12850658 ## 12 G2 1.46705561 ## 13 G3 0.58451869 ## 14 G3 -0.20231228 ## 15 G3 0.52519475 ## 16 G3 0.62956475 ## 17 G3 -0.06655426 ## 18 G3 0.56814703 

Note that if you multiply columns that were not set as a key, you need to use the column name in the subset (for example, data.dt[Vals<0,] )

I think that the creators of data.table can work on a way to directly delete rows from the original table, instead of copying all non-deleted rows to a new table and then deleting the original table. This will be a big help when you run into memory limitations.

+3
source

The problem is that you do not select which rows for DELETE you select, which rows should be KEEP. And, as you have learned, you can often exchange these concepts, but sometimes problems arise.

In particular, when you use which , you specify R "which elements of this vector are true." However, when it does not find anything, it indicates this by returning integer(0) .

The integer (0) is not a real number, and therefore taking a negative value of Integer (0) still gives Integer (0).

However, there is no need to use that if you are going to just use it for filtering.

Instead, take the statement that you go to which and pass it directly as a filter to data[..] . Recall that you can use a logical vector as an index, as well as an integer vector.

+2
source

All Articles