R - the sample used in% in% modifies the data frame, which is a subset of

Not sure if I named the question correctly, because I do not quite understand the reason for the behavior:

dfSet <- data.frame(ID = sample(1:15, size = 15, replace = FALSE), va1 = NA, va3 = 0, stringsAsFactors = FALSE) dfSet[1:10, ]$va1 <- 'o1' dfSet[11:15, ]$va1 <- 'o2' dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1 print(length(unique(dfSet$ID))) 

I expect the final print to show 15, but it is not. Instead, 13 or 14 appears, and dfSet is changed so that there are at least two lines with the same identifier. This piece of code seems to be:

 dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1 

change $ ID column - I don't know why?

Workaround:

 temp <- sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE) dfSet[dfSet$ID %in% temp, ]$va3 <- 1 

In this case, everything works as expected - 15 lines with a unique identifier.

The question is, why does direct use of the sample in% in% change the data frame?

+8
r sample subset
source share
3 answers

It seems that the problem is that R is doing some kind of complicated thing when you assign values ​​to the returned function. For example, something like

 a <- c(1,3) names(a) <- c("one", "three") 

will look very strange in most languages. How to assign a value to the return value of a function? What really happens is that a function is defined called names<- . Basically, this is returning a modified version of the original object, which can then be used to replace the value passed to this function. So it really looks like

 .temp. <- `names<-`(a, c("one","three")) a <- .temp. 

The variable a always completely replaced, not just its names.

When you do something like

 dfSet$a<-1 

what really happens again

 .temp. <- "$<-"(dfSet, a, 1) dfSet <- .temp. 

Now things get a little more complicated when you try to execute a subset of [] and $ . Look at this sample

 #for subsetting f <- function(x,v) {print("testing"); x==v} x <- rep(0:1, length.out=nrow(dfSet)) dfSet$a <- 0 dfSet[f(x,1),]$a<-1 

Please note that “testing” is printed twice. What happens is really more like

 .temp1. <- "$<-"(dfSet[f(x,1),], a, 1) .temp2. <- "[<-"(dfSet, f(x,1), , .temp1.) dfSet <- .temp2. 

So f(x,1) is estimated twice. This means that sample will be evaluated twice.

The error is more obvious, you are trying to replace a variable that does not exist yet

 dfSet[f(x,1),]$b<-1 # Warning message: # In `[<-.data.frame`(`*tmp*`, f(x, 1), , value = list(ID = c(6L, : # provided 4 variables to replace 3 variables 

You will get a warning here because the .temp1. variable .temp1. added a column and now has 4 columns, but when you try to assign .temp2. , now you have a problem that the slice of the data frame that you are trying to replace is a different size.

Identifiers are replaced because the $<- operator does not just return a new column, it returns a new data.frame with an updated column to any value you assign. This means that the rows that were updated are returned with the identifier that was there when the assignment occurred. This is stored in the .temp1. variable .temp1. . Then, when you perform the assignment [<- , you select a new rowset for swapping. The values ​​of all the columns of these rows are replaced by the values ​​from .temp1. . This means that you will overwrite the identifiers for the replacement strings, and they may be different, so you will probably end up with two or more copies of this identifier.

+7
source share

Although I'm not 100% sure, I suspect that R runs sample twice. When you are a subset and assign to R, for example:

 x[i:j,]$v1 <- 1 

It is evaluated as "takes the rows i in j from x as a temporary data frame, assigns 1 column v1 to this data frame, and then copies the temporary data frame to rows ij in x".

So, perhaps the indexing expression (i: j) is executed twice (once for extraction and once for return), and if it is a random variable, it will return the results in different rows than those that were originally selected.

+2
source share

Consider this simpler example:

 x <- data.frame(a=1:10, b=10:1) x$b <- 5 

What the second line does is

 x <- `$<-`(x, 'b', 5) 

You can see that $<- is just a function that takes three arguments, an object, a name, and a value. (Note that backlinks are necessary if you want to use $<- directly.)

The problem, I think, is that in your example x there is an expression that evaluates different things every time it is evaluated, due to the call to sample , so you should avoid this.

An alternative is to use [<- , which does not seem to have this problem:

 dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), 'va3'] <- 1 
+1
source share

All Articles