Improving C ++ file reading line by line?

Question

Improving C ++ file reading line by line?

I am parsing a 500 GB log file and my C ++ version takes 3.5 minutes and my Go version takes 1.2 minutes.

I use C ++ streams to stream each line of a file for parsing.

#include <fstream> #include <string> #include <iostream> int main( int argc , char** argv ) { int linecount = 0 ; std::string line ; std::ifstream infile( argv[ 1 ] ) ; if ( infile ) { while ( getline( infile , line ) ) { linecount++ ; } std::cout << linecount << ": " << line << '\n' ; } infile.close( ) ; return 0 ; }

First, why use this code so slowly? Secondly, how can I improve it to make it faster?

+7

c ++ performance file-io

jimjampez Dec 28 '15 at 12:54

source share

3 answers

Based on @Ralph Tandetzky's answer , let's move on to the low-level C IO functions and suppose the Linux platform uses a file system that provides good direct I / O support (but remaining single-threaded):

 #define BUFSIZE ( 1024UL * 1024UL ) int main( int argc, char **argv ) { // use direct IO - the page cache only slows this down int fd = ::open( argv[ 1 ], O_RDONLY | O_DIRECT ); // Direct IO needs page-aligned memory char *buffer = ( char * ) ::valloc( BUFSIZE ); size_t newlines = 0UL; // avoid any conditional checks in the loop - have to // check the return value from read() anyway, so use that // to break the loop explicitly for ( ;; ) { ssize_t bytes_read = ::read( fd, buffer, BUFSIZE ); if ( bytes_read <= ( ssize_t ) 0L ) { break; } // I'm guessing here that computing a boolean-style // result and adding it without an if statement // is faster - might be wrong. Try benchmarking // both ways to be sure. for ( size_t ii = 0; ii < bytes_read; ii++ ) { newlines += ( buffer[ ii ] == '\n' ); } } ::close( fd ); std::cout << "newlines: " << newlines << endl; return( 0 ); }

If you really need to go even faster, use multiple threads to read and count lines so that you read data while counting new lines. But if you're not working on really fast hardware designed for high performance, that's too much.

+4

Andrew Henle Dec 28 '15 at 17:07

source share

The I / O routines from good old C should be significantly faster than the awkward C ++ threads. If you know a reasonable upper bound on the lengths of all lines, you can use fgets in combination with a buffer like char line[1<<20]; . Since you are going to analyze the data, you can simply use fscanf directly from your file.

Please note that if your file is physically stored on the hard drive, the read speed of the hard drive will in any case become a bottleneck, as indicated here . Therefore, in order to minimize processing time, you do not need the fastest processor parsing, perhaps just fscanf .

0

stgatilov Dec 28 '15 at 18:47

source share

Ralph tandetzky · Accepted Answer · 2015-12-28T13:45:24+0000

C ++ standard libraries iostreams are known to be slow, and this applies to all the different standard library implementations. What for? Because the standard imposes many implementation requirements that hinder better performance. This part of the standard library was designed about 20 years ago and is not very competitive in high-performance tests.

How can you avoid this? Use other libraries for high-performance asynchronous I / O, such as boost asio or the native functions provided by your OS.

If you want to stay within the standard, the std::basic_istream::read() function can satisfy your performance requirements. But in this case you have to do your buffering and line counting. Here's how to do it.

 #include <algorithm> #include <fstream> #include <iostream> #include <vector> int main( int, char** argv ) { int linecount = 1 ; std::vector<char> buffer; buffer.resize(1000000); // buffer of 1MB size std::ifstream infile( argv[ 1 ] ) ; while (infile) { infile.read( buffer.data(), buffer.size() ); linecount += std::count( buffer.begin(), buffer.begin() + infile.gcount(), '\n' ); } std::cout << "linecount: " << linecount << '\n' ; return 0 ; }

Let me know if it's faster!

Improving C ++ file reading line by line?

More articles: