In R, how do I read a CSV file line by line and recognize the contents as the correct data type?

I want to read in a CSV file, the first line of which is the names of the variables, and the next lines are the contents of these variables. Some of the variables are numeric, and some of them are textual, and some are even empty.

file = "path/file.csv" f = file(file,'r') varnames = strsplit(readLines(f,1),",")[[1]] data = strsplit(readLines(f,1),",")[[1]] 

Now that the data contains all the variables, how to do it so that the data can recognize the type of data that is being read in the same way as if I did read.csv .

I need to read data rows by row (or n rows at a time), since the entire data set is too large to read in R.

+5
source share
7 answers

You can use chunked or disk.frame if you don't mind working a bit on writing your data.

Both have options that allow you to read data in parts.

0
source

Based on the DWin comment, you can try something like this:

 read.clump <- function(file, lines, clump){ if(clump > 1){ header <- read.csv(file, nrows=1, header=FALSE) p = read.csv(file, skip = lines*(clump-1), #p = read.csv(file, skip = (lines*(clump-1))+1 if not a textConnection nrows = lines, header=FALSE) names(p) = header } else { p = read.csv(file, skip = lines*(clump-1), nrows = lines) } return(p) } 

You should probably also add some error handling and function checking.

Then with

 x = "letter1, letter2 a, b c, d e, f g, h i, j k, l" >read.clump(textConnection(x), lines = 2, clump = 1) letter1 letter2 1 ab 2 cd > read.clump(textConnection(x), lines = 2, clump = 2) letter1 letter2 1 ef 2 gh > read.clump(textConnection(x), lines = 3, clump = 1) letter1 letter2 1 ab 2 cd 3 ef > read.clump(textConnection(x), lines = 3, clump = 2) letter1 letter2 1 gh 2 ij 3 kl 

Now you just need to * apply to clumps

+11
source

An alternative strategy that was discussed here earlier is to deal with very large (say> 1e7ish cells) CSV files:

  • Read the CSV file in the SQLite database.
  • Import data from the database using read.csv.sql from the sqldf package.

The main advantages of this is that it is usually faster, and you can easily filter the content to include only those columns or rows that you need.

See how to import CSV into sqlite using RSqlite? for more information.

+6
source

Just for fun (I'm waiting for a long calculation here :-)), a version that allows you to use any of the functions like read.* And which contains a solution to a tiny error in \ Greg code:

 read.clump <- function(file, lines, clump, readFunc=read.csv, skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0), nrows=lines,header=TRUE,...){ if(clump > 1){ colnms<-NULL if(header) { colnms<-unlist(readFunc(file, nrows=1, header=FALSE)) print(colnms) } p = readFunc(file, skip = skip, nrows = nrows, header=FALSE,...) if(! is.null(colnms)) { colnames(p) = colnms } } else { p = readFunc(file, skip = skip, nrows = nrows, header=header) } return(p) } 

Now you can pass the corresponding function as the readFunc parameter, as well as pass additional parameters. Meta programming is fun.

+4
source

On the side: if you really have such huge data, there are (next to the SQLite solution) different packages that will help you deal with this without resorting to tricks, as described in these answers.

There ff and bigmemory package with friends biganalytics , bigtabulate 'biglm', etc. For an overview, see for example.

+3
source

I will try the LaF package :

Methods for quick access to large ASCII files ... It is assumed that the files are too large to fit in memory ... Methods are provided for accessing and processing files block by block. In addition, an open file can access them as normal data.frame ...

I managed to get it to seemingly work with the code example below, and it seems to have the performance you expect from a streaming implementation. However, I would recommend you also run your own time tests.

 library('LaF') model <- detect_dm_csv('data.csv', header = TRUE, nrows = 600) # read only 600 rows for type detection mylaf <- laf_open(model) print(mylaf[1000]) # print 1000th row 
+1
source

I think using disk.frame csv_to_disk.frame and setting in_chunk_size would be useful for this use case. eg

 library(disk.frame) csv_to_disk.frame("/path/to/file.csv", in_chunk_size = 1e7) 
0
source

All Articles