Fastest way to copy the first X lines from one file to another in R? (Cross platform)

I can’t load the file into RAM (suppose the user may need the first billion file with ten billion records)

here is my solution, but I think there should be a faster way?

thanks

# specified by the user infile <- "/some/big/file.txt" outfile <- "/some/smaller/file.txt" num_lines <- 1000 # my attempt incon <- file( infile , "r") outcon <- file( outfile , "w") for ( i in seq( num_lines ) ){ line <- readLines( incon , 1 ) writeLines( line , outcon ) } close( incon ) close( outcon ) 
+7
io r large-files
source share
8 answers

C ++ Solution

It is not so difficult to write C ++ code for this:

 #include <fstream> #include <Rh> #include <Rdefines.h> extern "C" { // [[Rcpp::export]] SEXP dump_n_lines(SEXP rin, SEXP rout, SEXP rn) { // no checks on types and size std::ifstream strin(CHAR(STRING_ELT(rin, 0))); std::ofstream strout(CHAR(STRING_ELT(rout, 0))); int N = INTEGER(rn)[0]; int n = 0; while (strin && n < N) { char c = strin.get(); if (c == '\n') ++n; strout.put(c); } strin.close(); strout.close(); return R_NilValue; } } 

When saving as yourfile.cpp you can do

 Rcpp::sourceCpp('yourfile.cpp') 

From RStudio you do not need to download anything. In the console, you have to download Rcpp. You may need to install Rtools on Windows.

More efficient R code

When reading large blocks instead of single lines, your code will also speed up:

 dump_n_lines2 <- function(infile, outfile, num_lines, block_size = 1E6) { incon <- file( infile , "r") outcon <- file( outfile , "w") remain <- num_lines while (remain > 0) { size <- min(remain, block_size) lines <- readLines(incon , n = size) writeLines(lines , outcon) # check for eof: if (length(lines) < size) break remain <- remain - size } close( incon ) close( outcon ) } 

Benchmark

 lines <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean commodo imperdiet nunc, vel ultricies felis tincidunt sit amet. Aliquam id nulla eu mi luctus vestibulum ac at leo. Integer ultrices, mi sit amet laoreet dignissim, orci ligula laoreet diam, id elementum lorem enim in metus. Quisque orci neque, vulputate ultrices ornare ac, interdum nec nunc. Suspendisse iaculis varius dapibus. Donec eget placerat est, ac iaculis ipsum. Pellentesque rhoncus maximus ipsum in hendrerit. Donec finibus posuere libero, vitae semper neque faucibus at. Proin sagittis lacus ut augue sagittis pulvinar. Nulla fermentum interdum orci, sed imperdiet nibh. Aliquam tincidunt turpis sit amet elementum porttitor. Aliquam lectus dui, dapibus ut consectetur id, mollis quis magna. Donec dapibus ac magna id bibendum." lines <- rep(lines, 1E6) writeLines(lines, con = "big.txt") infile <- "big.txt" outfile <- "small.txt" num_lines <- 1E6L library(microbenchmark) microbenchmark( solution0(infile, outfile, num_lines), dump_n_lines2(infile, outfile, num_lines), dump_n_lines(infile, outfile, num_lines) ) 

Results in (solution0 - original OP solution):

 Unit: seconds expr min lq mean median uq max neval cld solution0(infile, outfile, num_lines) 11.523184 12.394079 12.635808 12.600581 12.904857 13.792251 100 c dump_n_lines2(infile, outfile, num_lines) 6.745558 7.666935 7.926873 7.849393 8.297805 9.178277 100 b dump_n_lines(infile, outfile, num_lines) 1.852281 2.411066 2.776543 2.844098 2.965970 4.081520 100 a 

A C ++ solution can probably be accelerated by reading in large blocks of data at a time. However, this will make the code much more complicated. If this is not what I need to do on a regular basis, I would probably stick with a clean R-solution.

Note. When your data is tabular, you can use my LaF package to read arbitrary rows and columns from your data set without having to read all the data in memory.

+6
source share

You can use ff::read.table.ffdf for this. It stores data on the hard drive and does not use RAM.

 library(ff) infile <- read.table.ffdf(file = "/some/big/file.txt") 

Essentially, you can use the above function in the same way as base::read.table with the difference that the resulting object will be saved on your hard drive.

You can also use the nrow argument and load a specific number of lines. The documentation is here if you want to read. Once you read the file, you can multiply the specific lines you need and even convert them to data.frames if they can fit in RAM.

There is also a function write.table.ffdf that allows you to write an ffdf object (as a result of read.table.ffdf ), which will simplify the process.


For an example of how to use read.table.ffdf (or read.delim.ffdf , which is almost the same), see the following:

 #writting a file on my current directory #note that there is no standard number of columns sink(file='test.txt') cat('foo , foo, foo\n') cat('foo, foo\n') cat('bar bar , bar\n') sink() #read it with read.delim.ffdf or read.table.ffdf read.delim.ffdf(file='test.txt', sep='\n', header=F) 

Output:

 ffdf (all open) dim=c(3,1), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen V1 V1 integer integer FALSE FALSE FALSE 1 1 1 TRUE ffdf data V1 1 foo , foo, foo 2 foo, foo 3 bar bar , bar 

If you are using a txt file, this is a common solution, as each line ends with the \n character.

+7
source share

I like pipes, as we can use other tools. And conveniently, the interface (really great) of connections in R supports it:

 ## scratch file filename <- "foo.txt" ## create a file, no header or rownames for simplicity write.table(1:50, file=filename, col.names=FALSE, row.names=FALSE) ## sed command: print from first address to second, here 4 to 7 ## the -n suppresses output unless selected cmd <- paste0("sed -n -e '4,7p' ", filename) ##print(cmd) # to debug if needed ## we use the cmd inside pipe() as if it was file access so ## all other options to read.csv (or read.table) are available too val <- read.csv(pipe(cmd), header=FALSE, col.names="selectedRows") print(val, row.names=FALSE) ## clean up unlink(filename) 

If we run this, we get lines from four to seven, as expected:

 edd@max :/tmp$ r piper.R selectedRows 4 5 6 7 edd@max :/tmp$ 

Note that our use of sed made no assumptions about file structures other than the assumption

  • standard text file "ascii" for reading in text mode
  • standard CR / LF line ends as record separators

If you were accepting binary files with different record separators, we could offer different solutions.

Also note that you control the command passed to the pipe() function. Therefore, if you need lines from 1000004 to 1000007, the use will be the same: you just specify the first and last line (of each segment, there may be several). And instead of read.csv() your readLines() can be used equally well.

Finally, sed is available everywhere and, if memory serves, is also part of Rtools. The basic filtering function can also be obtained using Perl or a number of other tools.

+6
source share

I usually speed up such loops by reading and writing pieces, say, 1000 lines. If num_lines multiple of 1000, the code becomes:

 # specified by the user infile <- "/some/big/file.txt" outfile <- "/some/smaller/file.txt" num_lines <- 1000000 # my attempt incon <- file( infile, "r") outcon <- file( outfile, "w") step1 = 1000 nsteps = ceiling(num_lines/step1) for ( i in 1:nsteps ){ line <- readLines( incon, step1 ) writeLines( line, outcon ) } close( incon ) close( outcon ) 
+3
source share

The operating system is the best level for large file manipulations. This is fast and comes with a benchmark (which seems important given that the poster has set a faster method):

 # create test file in shell echo "hello world" > file.txt for i in {1..29}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done wc -l file.txt # about a billion rows 

It takes a few seconds for billions of lines. Change 29 to 32 to get about ten billion.

Then in R, using ten million lines out of a billion (one hundred million is too slow to compare with a poster solution)

 # in R, copy first ten million rows of the billion system.time( system("head -n 10000000 file.txt > out.txt") ) # posters solution system.time({ infile <- "file.txt" outfile <- "out.txt" num_lines <- 1e7 incon <- file( infile , "r") outcon <- file( outfile , "w") for ( i in seq( num_lines )) { line <- readLines( incon , 1 ) writeLines( line , outcon ) } close( incon ) close( outcon ) }) 

And the results on the average MacBook Pro, a couple of years.

 Rscript head.R user system elapsed 1.349 0.164 1.581 user system elapsed 620.665 3.614 628.260 

It would be interesting to know how fast the other solutions are.

+3
source share

The β€œright” or best answer for this would be to use a language that works much easier with file descriptors. For example, while perl is an ugly language in many ways, it shines. Python can also do this very well, in more detail.


However, you explicitly stated that you want something in R. First, I assume that this thing may not be a CSV or other limited flat file.

Use the readr library. Within this library, use read_lines() . Something like this (first, get # lines in the whole file, using something like the one shown here ):

 library(readr) # specified by the user infile <- "/some/big/file.txt" outfile <- "/some/smaller/file.txt" num_lines <- 1000 # readr attempt # num_lines_tot is found via the method shown in the link above num_loops <- ceiling(num_lines_tot / num_lines) incon <- file( infile , "r") outcon <- file( outfile , "w") for ( i in seq(num_loops) ){ lines <- read_lines(incon, skip= (i - 1) * num_lines, n_max = num_lines) writeLines( lines , outcon ) } close( incon ) close( outcon ) 

A few notes:

  • There is no pleasant and convenient way to write in the readr library, which will be as general as you think. (There is, for example, write_delim , but you did not specify a separator.)
  • All information found in previous incarnations of the "outfile" will be lost. I'm not sure if you wanted to open "outfile" in add mode ( "a" ), but I suspect that would be useful.
  • I found that when working with large files like this, often I want to filter data by opening it like this. Making a simple copy seems odd. Maybe you want to do more?
  • If you have a delimited file, you want to see read_csv or read_delim in the readr package.
+2
source share

Try the head utility. It should be available on all operating systems supported by R (Windows assumes that you have Rtools installed, and the Rtools bin directory is in your path). For example, to copy the first 100 lines from in.dat to out.dat:

 shell("head -n 100 in.dat > out.dat") 
+2
source share

try using

 line<-read.csv(infile,nrow=1000) write(line,file=outfile,append=T) 
-2
source share

All Articles