How to read a large dataset in R

Possible duplicate:
Fast reading of very large tables as data in R

Hi,

trying to read a large dataset in R, the console displayed the following errors:

data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE) > data = data[complete.cases(data),] > dataset<-data.frame(user_id=as.character(data[,1]),event_date= as.character(data[,2]),day_of_week=as.factor(data[,3]),distinct_events_a_count=as.numeric(as.character(data[,4])),total_events_a_count=as.numeric(as.character(data[,5])),events_a_duration=as.numeric(as.character(data[,6])),distinct_events_b_count=as.numeric(as.character(data[,7])),total_events_b=as.numeric(as.character(data[,8])),events_b_duration= as.numeric(as.character(data[,9]))) Error: cannot allocate vector of size 94.3 Mb In addition: Warning messages: 1: In data.frame(user_msisdn = as.character(data[, 1]), calls_date = as.character(data[, : NAs introduced by coercion 2: In data.frame(user_msisdn = as.character(data[, 1]), calls_date = as.character(data[, : NAs introduced by coercion 3: In class(value) <- "data.frame" : Reached total allocation of 3583Mb: see help(memory.size) 4: In class(value) <- "data.frame" : Reached total allocation of 3583Mb: see help(memory.size) 

Does anyone know how to read large datasets? UserDailyStats.csv is approximately 2 GB in size.

+4
source share
3 answers

Sure:

  • Get a larger computer, in particular, more ram
  • Launch a 64-bit OS, see section 1) that you can use it more
  • Read only the columns you need.
  • Read fewer lines
  • Read the data in binary format, not in a 2gb reanalysis (which is very inefficient).

There is also a guide for this in R.

+13
source

You can try to specify the data type in the read.csv call using colClasses .

 data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE, colClasses=c("character","character","factor",rep("numeric",6))) 

Although with a data set of this size it can still be problematic, and there is a lot of memory left for the analysis that you might want to do. Adding RAM and using 64-bit computing will provide more flexibility.

+1
source

If this is output from the console, then you are reading the data, but there are problems with the conversions.

If you are working online, after read.csv save your data with save(data, file="data.RData") , close R, start a new instance, load data with load("data.RData") and see if it works.

But from these error messages, I see that you have conversion problems, so you should look at that.

+1
source

All Articles