R: Is it possible to parallelize / speed up reading in the 20 millionth line of CSV in R?

Question

R: Is it possible to parallelize / speed up reading in the 20 millionth line of CSV in R?

Once CSV is loaded via read.csv , it is quite simple to use multicore , segue , etc. to play data in CSV. However, reading it is currently quite tedious.

Understand that it is better to use mySQL, etc.

Suppose an AWS 8xl cluster computing instance is running R2.13

The spectrum is as follows:

 Cluster Compute Eight Extra Large specifications: 88 EC2 Compute Units (Eight-core 2 x Intel Xeon) 60.5 GB of memory 3370 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet)

Any thoughts / ideas that are much appreciated.

+8

parallel-processing r csv bigdata

new Jan 30 '12 at 7:04

source share

3 answers

Richard Erickson · Answer 1 · 2015-05-01T18:57:00+0000

Parallel operation may not be necessary if you use fread in data.table .

 library(data.table) dt <- fread("myFile.csv")

The commentary to this question illustrates its power. Also here is an example from my own experience:

 d1 <- fread('Tr1PointData_ByTime_new.csv') Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

I was able to read 1.04 million lines in less than 10 seconds!

Paul hiemstra · Answer 2 · 2012-01-30T08:15:10+0000

What you can do is use scan . Two of his input arguments are interesting: n and skip . You simply open two or more connections to the file and use skip and n to select the part you want to read from the file. There are some caveats:

At some point, the i / o disc can prove the neck of the bottle.
I hope the scan will not complain when opening multiple connections to the same file.

But you can give it a try and see if it gives your speed.

John · Answer 3 · 2012-01-30T13:28:49+0000

Flash or conventional HD storage? If the latter, then if you do not know where the file is located on the disks, and how it is divided, it is very difficult to speed up the work, because several simultaneous readings will not be faster than one streaming read. This is because of the disk, not the CPU. It is not possible to parallelize this without starting at the file storage level.

If it's flash memory, then a solution like Paul Hiemstra might help, since a good flash memory can have an excellent read speed close to sequential. Try it ... but if that doesn't help you understand why.

In addition, there is no need for a fast storage interface, because drives can saturate it. Do you perform performance testing on disks to see how real they really are?

R: Is it possible to parallelize / speed up reading in the 20 millionth line of CSV in R?

More articles: