Rownames for data.table in R for model.matrix

I have a data.table DT and I want to run model.matrix on it. Each row has a row identifier, which is stored in the DT ID column. When I run model.matrix on DT , my formula excludes the ID column. The problem is that model.matrix omits some lines due to NA. If I set the model.matrix names from DT to the ID column, before calling model.matrix , then the final model matrix has the model.matrix names, and I'm all set up. Otherwise, I cannot figure out which lines I hit. I set rownames(DT) = DT$ID names with rownames(DT) = DT$ID . However, when I try to add a new column in DT , I get a complaint about

"An invalid .internal.selfref was detected ... At an earlier point, this data.table was copied by R."

So I wonder

  • Is there a better way to specify data.table names for data.table
  • Is there a better approach to solving this problem.
+4
source share
1 answer

There are several issues here.

The first is the data.table property, that they do not have rownames , instead they have key , which are much more powerful. See this wonderful vignette .

But this is not the end of the world. model.matrix returns sensitive model.matrix names when you pass it to data.table

for instance

 A <- data.table(ID = 1:5, x = c(NA, 1:4), y = c(4:2,NA,3)) mm <- model.matrix( ~ x + y, A) rownames(mm) ## [1] "2" "3" "5" 

So lines 2,3 and 5 are those included in model.matrix.

Now you can add this sequence as a column to A This will be useful if you then set the key to something else (thereby losing the original order)

 A[, rowid := seq_len(nrow(A)] 

You might consider making it a character (e.g., the names of the growths in mm )), but that doesn't really matter (since you can just as easily convert rownames(mm) to numeric when you need to refer.

Regarding the warning given by data.table if you read the following sentence

Avoid the key <-, names <- and attr <- which in R currently (and weirdly) can copy the entire data table. Use the set * syntax instead to avoid copying: setkey (), setnames () and setattr ()

rownames are an attribute of rownames<- (internally at some point using the equivalent of attr<- ) will (possibly copy) in the same way.

The line from `row.names<-.data.frame` is equal to

 attr(x, "row.names") <- value 

At the same time, data.tables do not have data.tables names, so it makes no sense to set them.

+10
source

All Articles