Import and extract random samples from a large .CSV to R

I am doing some analysis in R where I need to work with some large data sets (10-20 GB, stored in .csv and use the read.csv function).

Since I also need to combine and convert large .csv files with other data frames, I do not have the processing power or memory to import the entire file.

I was wondering if anyone knows of a way to import a random percentage from csv.

I saw several examples where people imported the entire file and then used a separate function to create another data frame, which is a sample of the original, however I hope for something less intense.

+4
source share
1 answer

I think there is no good R tool for reading a file randomly (maybe it could be an extension read.tableor fread(package data.table)).

Using perl, you can easily complete this task. For example, to read 1% of your file randomly, you can do this:

xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)

Here I call it from R using system. xx now only contains 1% of your file.

You can wrap it all in a function:

read_partial_rand <- 
  function(big_file,percent){
    cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
    cmd <- paste(cmd,big_file)
    system(cmd,intern=TRUE)
  }
+6
source

All Articles