Whether a copy is executed when the function returns a data table.

I am updating a set of functions that previously only data.frame objects for working with data.table arguments.

I decided to implement this function by sending the R method, so that the old code using data.frame still work with the updated functions. In one of my functions, I take as input data.frame as input, modify it and return the modified data.frame . I created an implementation of data.table . For instance:

 # The functions foo <- function(d) { UseMethod("foo") } foo.data.frame <- function(d) { <Do Something> return(d) } foo.data.table <- function(d) { <Do Something> return(d) } 

I know that data.table works by making changes without copying, and I implemented foo.data.table , keeping this in mind. However, I am returning the data.table object at the end of the function because I want my old scripts to work with the new data.table objects. Will it make a copy of data.table ? How can i check? According to the documentation, to create a copy of data.table you need to be very explicit, but I'm not sure about this.

The reason I want to return something when I don't need with data.tables :

My old scripts look like this:

 someData <- read.table(...) ... someData <- foo(someData) 

I want the scripts to be able to work with data.table , just by changing the data receiving lines. In other words, I want the script to work just by changing someData <- read.table(...) to someData <- fread(...) .

+6
source share
2 answers

Thanks to Arun for his reply in the comments. I will use his example in my comments to answer the question.

You can check whether copies are made using the tracemem function to track the object in R. From the function’s help file ?tracemem , the description says:

This function marks an object so that a message is printed whenever the internal code copies the object. This is the main reason for using data that is difficult to predict memory usage in R.

For instance:

 # Using a data.frame df <- data.frame(x=1:5, y=6:10) tracemem(df) ## [1] "<0x32618220>" df$y[2L] <- 11L ## tracemem[0x32618220 -> 0x32661a98]: ## tracemem[0x32661a98 -> 0x32661b08]: $<-.data.frame $<- ## tracemem[0x32661b08 -> 0x32661268]: $<-.data.frame $<- df ## xy ## 1 1 6 ## 2 2 11 ## 3 3 8 ## 4 4 9 ## 5 5 10 # Using a data.table dt <- data.table(x=1:5, y=6:10) tracemem(dt) ## [1] "<0x5fdab40>" set(dt, i=2L, j=2L, value=11L) # No memory output! address(dt) # Verify the address in memory is the same ## [1] "0x5fdab40" dt ## xy ## 1: 1 6 ## 2: 2 11 ## 3: 3 8 ## 4: 4 9 ## 5: 5 10 

It seems that the data.frame object data.frame copied twice when changing one element in data.frame , and the data.table changes in place without copying!

From my question, I can simply trace the data.table or data.frame , d object before passing it to the function, foo , to check if any copies have been made.

+5
source

Not sure if this adds anything, but as a cautionary trick, pay attention to the following behavior:

 library(data.table) foo.data.table <- function(d) { d[,A:=4] d$B <- 1 d[,C:=1] return(d) } set.seed(1) dt <- data.table(A=rnorm(5),B=runif(5),C=rnorm(5)) dt # ABC # 1: -0.6264538 0.2059746 -0.005767173 # 2: 0.1836433 0.1765568 2.404653389 # 3: -0.8356286 0.6870228 0.763593461 # 4: 1.5952808 0.3841037 -0.799009249 # 5: 0.3295078 0.7698414 -1.147657009 result <- foo.data.table(dt) dt # ABC # 1: 4 0.2059746 -0.005767173 # 2: 4 0.1765568 2.404653389 # 3: 4 0.6870228 0.763593461 # 4: 4 0.3841037 -0.799009249 # 5: 4 0.7698414 -1.147657009 result # ABC # 1: 4 1 1 # 2: 4 1 1 # 3: 4 1 1 # 4: 4 1 1 # 5: 4 1 1 

Thus, it is obvious that dt is passed by reference to foo.data.table(...) , and the first d[,A:=4] statement changes it by reference, changing column A to dt .

The second operator d$B <- 1 forces you to create a copy of function d (also called d ) with an internal function of the function. Then the third statement d[,C:=1] changes this by reference (but does not affect dt ) and return(d) then returns a copy.

If you change the order of the second and third operators, the effect of calling the function on dt is different.

+3
source

All Articles