C ++ Solution
It is not so difficult to write C ++ code for this:
#include <fstream> #include <Rh> #include <Rdefines.h> extern "C" { // [[Rcpp::export]] SEXP dump_n_lines(SEXP rin, SEXP rout, SEXP rn) { // no checks on types and size std::ifstream strin(CHAR(STRING_ELT(rin, 0))); std::ofstream strout(CHAR(STRING_ELT(rout, 0))); int N = INTEGER(rn)[0]; int n = 0; while (strin && n < N) { char c = strin.get(); if (c == '\n') ++n; strout.put(c); } strin.close(); strout.close(); return R_NilValue; } }
When saving as yourfile.cpp you can do
Rcpp::sourceCpp('yourfile.cpp')
From RStudio you do not need to download anything. In the console, you have to download Rcpp. You may need to install Rtools on Windows.
More efficient R code
When reading large blocks instead of single lines, your code will also speed up:
dump_n_lines2 <- function(infile, outfile, num_lines, block_size = 1E6) { incon <- file( infile , "r") outcon <- file( outfile , "w") remain <- num_lines while (remain > 0) { size <- min(remain, block_size) lines <- readLines(incon , n = size) writeLines(lines , outcon)
Benchmark
lines <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean commodo imperdiet nunc, vel ultricies felis tincidunt sit amet. Aliquam id nulla eu mi luctus vestibulum ac at leo. Integer ultrices, mi sit amet laoreet dignissim, orci ligula laoreet diam, id elementum lorem enim in metus. Quisque orci neque, vulputate ultrices ornare ac, interdum nec nunc. Suspendisse iaculis varius dapibus. Donec eget placerat est, ac iaculis ipsum. Pellentesque rhoncus maximus ipsum in hendrerit. Donec finibus posuere libero, vitae semper neque faucibus at. Proin sagittis lacus ut augue sagittis pulvinar. Nulla fermentum interdum orci, sed imperdiet nibh. Aliquam tincidunt turpis sit amet elementum porttitor. Aliquam lectus dui, dapibus ut consectetur id, mollis quis magna. Donec dapibus ac magna id bibendum." lines <- rep(lines, 1E6) writeLines(lines, con = "big.txt") infile <- "big.txt" outfile <- "small.txt" num_lines <- 1E6L library(microbenchmark) microbenchmark( solution0(infile, outfile, num_lines), dump_n_lines2(infile, outfile, num_lines), dump_n_lines(infile, outfile, num_lines) )
Results in (solution0 - original OP solution):
Unit: seconds expr min lq mean median uq max neval cld solution0(infile, outfile, num_lines) 11.523184 12.394079 12.635808 12.600581 12.904857 13.792251 100 c dump_n_lines2(infile, outfile, num_lines) 6.745558 7.666935 7.926873 7.849393 8.297805 9.178277 100 b dump_n_lines(infile, outfile, num_lines) 1.852281 2.411066 2.776543 2.844098 2.965970 4.081520 100 a
A C ++ solution can probably be accelerated by reading in large blocks of data at a time. However, this will make the code much more complicated. If this is not what I need to do on a regular basis, I would probably stick with a clean R-solution.
Note. When your data is tabular, you can use my LaF package to read arbitrary rows and columns from your data set without having to read all the data in memory.