Fast text file reading in C ++

I am currently writing a program in C ++ that includes reading large text files. Each of them has ~ 400,000 lines, in extreme cases 4,000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took about 60 seconds, which is too long. Now I was wondering if there is an easy way to improve read speed?

edit: The code I'm using is more or less like this:

string tmpString; ifstream txtFile(path); if(txtFile.is_open()) { while(txtFile.good()) { m_numLines++; getline(txtFile, tmpString); } txtFile.close(); } 

edit 2: The file I'm reading has only 82 MB. I basically said that it can reach 4000, because I thought that perhaps you need to know to do buffering.

edit 3: Thanks everyone for your answers, but it seems that I don't have much room for improvement, given my problems. I need to use readline, since I want to count the number of lines. Ignoring ifstream as a binary also did not make reading faster. I will try to parallelize it as much as possible, which should work as a minimum.

edit 4: So there are apparently some things that I can. Thank you so much for giving so much time to this, I really appreciate it! =)

+43
c ++ performance io ifstream
Jul 29 '13 at 13:12
source share
6 answers

Updates: Be sure to check for (amazing) updates below the original answer




Memory mapping files supported me well 1 :

 #include <boost/iostreams/device/mapped_file.hpp> // for mmap #include <algorithm> // for std::find #include <iostream> // for std::cout #include <cstring> int main() { boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly); auto f = mmap.const_data(); auto l = f + mmap.size(); uintmax_t m_numLines = 0; while (f && f!=l) if ((f = static_cast<const char*>(memchr(f, '\n', lf)))) m_numLines++, f++; std::cout << "m_numLines = " << m_numLines << "\n"; } 

It should be pretty fast.

Update

In case this helps you test this approach, use using mmap directly instead of using Boost: see it live on Coliru

 #include <algorithm> #include <iostream> #include <cstring> // for mmap: #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> const char* map_file(const char* fname, size_t& length); int main() { size_t length; auto f = map_file("test.cpp", length); auto l = f + length; uintmax_t m_numLines = 0; while (f && f!=l) if ((f = static_cast<const char*>(memchr(f, '\n', lf)))) m_numLines++, f++; std::cout << "m_numLines = " << m_numLines << "\n"; } void handle_error(const char* msg) { perror(msg); exit(255); } const char* map_file(const char* fname, size_t& length) { int fd = open(fname, O_RDONLY); if (fd == -1) handle_error("open"); // obtain file size struct stat sb; if (fstat(fd, &sb) == -1) handle_error("fstat"); length = sb.st_size; const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u)); if (addr == MAP_FAILED) handle_error("mmap"); // TODO close fd at some point in time, call munmap(...) return addr; } 



Update

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc . To my surprise, using the following (greatly simplified) code adapted from wc is executed at approximately 84% of the time taken with the above memory file:

 static uintmax_t wc(char const *fname) { static const auto BUFFER_SIZE = 16*1024; int fd = open(fname, O_RDONLY); if(fd == -1) handle_error("open"); /* Advise the kernel of our access pattern. */ posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL char buf[BUFFER_SIZE + 1]; uintmax_t lines = 0; while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) { if(bytes_read == (size_t)-1) handle_error("read failed"); if (!bytes_read) break; for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p) ++lines; } return lines; } 



1 see, for example, a reference indicator here: How to quickly parse floats with spatial separation in C ++?

+57
Jul 29 '13 at 13:17
source share

4000 * 400 000 = 1.6 GB, if your hard drive is not an SSD, you probably get ~ 100 MB / s sequential read. It is 16 seconds only in I / O.

Since you do not specify the specific code that you are using, or how you need to parse these files (do you need to read it one by one, the system has a lot of RAM, can you read the whole file into a large memory buffer and then analyze it?) There is little that done to speed up the process.

Files with memory mapping will not improve performance when reading a file sequentially. Perhaps manual analysis of strong snippets for newlines, rather than using "getline," will provide an improvement.

EDIT After training (thanks @sehe). This is likely to use a memory mapping solution.

 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <errno.h> int main() { char* fName = "big.txt"; // struct stat sb; long cntr = 0; int fd, lineLen; char *data; char *line; // map the file fd = open(fName, O_RDONLY); fstat(fd, &sb); //// int pageSize; //// pageSize = getpagesize(); //// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize); data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0); line = data; // get lines while(cntr < sb.st_size) { lineLen = 0; line = data; // find the next line while(*data != '\n' && cntr < sb.st_size) { data++; cntr++; lineLen++; } /***** PROCESS LINE *****/ // ... processLine(line, lineLen); } return 0; } 
+8
Jul 29 '13 at 13:20
source share

Neil Kirk, unfortunately, I canโ€™t answer your comment (not enough reputation), but I did a performance test on ifstream stringstream and performance by reading a text file line by line, exactly the same.

 std::stringstream stream; std::string line; while(std::getline(stream, line)) { } 

It takes 1426 ms in a 106 MB file.

 std::ifstream stream; std::string line; while(ifstream.good()) { getline(stream, line); } 

It takes 1433 ms in one file.

The following code is faster:

 const int MAX_LENGTH = 524288; char* line = new char[MAX_LENGTH]; while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) { } 

It takes 884 ms in a single file. This is a bit complicated as you need to set the maximum size of your buffer (i.e. the maximum length for each line in the input file).

+3
Dec 15 '17 at 9:51 on
source share

Do you need to read all files at once? (e.g. at the beginning of your application)

If you do, consider parallelizing the operation.

In any case, consider using binary streams or unbffered read for data blocks.

+2
Jul 29 '13 at 13:31 on
source share

Use Random file access or use binary mode . for a consistent one, it's great, but still it depends on what you read.

+1
Jul 29 '13 at 13:22
source share

As someone with little experience in competitive programming, I can tell you: at least for simple things like integer parsing, the main cost in C is blocking file streams (which is the default for multithreading). Instead, use versions of unlocked_stdio ( fgetc_unlocked() , fread_unlocked() ). For C ++, it is generally accepted to use std::ios::sync_with_stdio(false) , but I don't know if it was as fast as unlocked_stdio .

For reference, here is my standard integer parsing code. This is much faster than scanf, as I said mainly because it didn't block the thread. For me, it was as fast as the best manual mmap encodings or custom buffered versions that I used earlier, without the debts of insane maintenance.

 int readint(void) { int n, c; n = getchar_unlocked() - '0'; while ((c = getchar_unlocked()) > ' ') n = 10*n + c-'0'; return n; } 

(Note: This mode only works if there is exactly one character without digits between any two integers).

And of course, avoid memory allocation if possible ...

+1
May 12 '17 at 1:35 a.m.
source share



All Articles