`setattr` at` levels`, keeping unwanted duplicates (R data.table)

key problem: using setattr to change level names, unwanted duplicates are saved.

I clear some data where I have seerarl level levels, all of which are the same, appear as two or more different levels. (This error is mainly due to problems with typos and files) I have 153K factors, and 5% need to be fixed.

Example

In the following example, the vector has three levels, two of which must be folded into one.

  incorrect <- factor(c("AOB", "QTX", "A_B")) # this is how the data were entered correct <- factor(c("AOB", "QTX", "AOB")) # this is how the data *should* be > incorrect [1] AOB QTX A_B Levels: A_B AOB QTX <~~ Note that "A_B" should be "AOB" > correct [1] AOB QTX AOB Levels: AOB QTX 

A vector is part of a data.table .
Everything works fine when using the levels<- function to change level names.
However, if you use setattr , unwanted duplicates are retained.

 mydt1 <- data.table(id=1:3, incorrect, key="id") mydt2 <- data.table(id=1:3, incorrect, key="id") # assigning levels, duplicate levels are dropped levels(mydt1$incorrect) <- gsub("_", "O", levels(mydt1$incorrect)) # using setattr, duplicate levels are not dropped setattr(mydt2$incorrect, "levels", gsub("_", "O", levels(mydt2$incorrect))) # RESULTS # Assigning Levels # Using `setattr` > mydt1$incorrect > mydt2$incorrect [1] AOB QTX AOB [1] AOB QTX AOB Levels: AOB QTX Levels: AOB AOB QTX <~~~ Notice the duplicate level 

Any thoughts on why this is and / or any options for changing this behavior? (i.e. ..., droplevels=TRUE ?) Thanks

+4
r duplicate-removal data.table
source share
1 answer

setattr is a low-level brute force method for changing attributes by reference. He does not know that the attribute "levels" is special. levels<- has more functionality inside it, but I suspect you might have found that levels(DT$col)<-newlevels will copy all DT (base <- ), therefore, for speed you looked at setattr .

I would not say the wrong answer. This is a valid and valid factor, but duplication of levels just happens.

To reset repeating levels, I think (untested):

 mydt1[,factorCol:=factor(factorCol)] 

must do it. You can go faster than finding which levels you changed by changing integers to point to the first duplicate, and then remove duplicates from the levels. The factor() call basically starts from scratch (i.e., it leads to all factors to character and deletion).

+4
source share

All Articles