Getline when reading a file and reading the entire file, and then splitting based on newline

I want to process every line of the file on the hard drive. Is it better to download the file as a whole and then split based on a newline (using boost) or is it better to use getline() ? My question is, does getline() read a single line when called (resulting in multiple access to the hard drive) or read the whole file and give line by line?

+8
c ++
source share
6 answers

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and what it is called, depends on the design of the C library. But, most likely, there is no clear difference in reading a line at a time compared to the whole file, because the OS at the lower level will read (at least) one disk block at a time, and most likely at least a "page" "(4 KB), if not more.

Also, unles you almost do nothing with your line after you read it (for example, you write something like "grep", so basically just read to find the line), it is unlikely that the overhead of reading the line while most of the time you spend.

But “downloading the entire file at a time” has several clear problems:

  • You do not start processing until you read the entire file.
  • You need enough memory to read the entire file into memory - what if the file size is several hundred GB? Does your program not work?

Do not try to optimize something if you have not used profiling to prove that this is part of why your code is slow. You just cause more problems for yourself.

Edit: So, I wrote a program to measure this, since I find it quite interesting.

And the results are certainly interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all the source files in a directory with about a dozen different source files, and then copying this file several times “multiply” it while it it doesn’t take more than 1.5 seconds to run the test, and for how long I think you need to run everything to make sure that the time is not too susceptible to a random “network packet” or other external influences taking time from the process).

I also decided to measure the system and user time in the process.

 $ ./bigfile Lines=24812608 Wallclock time for mmap is 1.98 (user:1.83 system: 0.14) Lines=24812608 Wallclock time for getline is 2.07 (user:1.68 system: 0.389) Lines=24812608 Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723) $ ./bigfile Lines=24812608 Wallclock time for mmap is 1.96 (user:1.83 system: 0.12) Lines=24812608 Wallclock time for getline is 2.07 (user:1.67 system: 0.392) Lines=24812608 Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707) 

There are three different functions for reading a file (there is a code for measuring time and so on, of course, but to reduce the size of this message I prefer not to publish all this - and I played with the order to see if it changed, so the results above are not in the same order as the functions here)

 void func_readwhole(const char *name) { string fullname = string("bigfile_") + name; ifstream f(fullname.c_str()); if (!f) { cerr << "could not open file for " << fullname << endl; exit(1); } f.seekg(0, ios::end); streampos size = f.tellg(); f.seekg(0, ios::beg); char* buffer = new char[size]; f.read(buffer, size); if (f.gcount() != size) { cerr << "Read failed ...\n"; exit(1); } stringstream ss; ss.rdbuf()->pubsetbuf(buffer, size); int lines = 0; string str; while(getline(ss, str)) { lines++; } f.close(); cout << "Lines=" << lines << endl; delete [] buffer; } void func_getline(const char *name) { string fullname = string("bigfile_") + name; ifstream f(fullname.c_str()); if (!f) { cerr << "could not open file for " << fullname << endl; exit(1); } string str; int lines = 0; while(getline(f, str)) { lines++; } cout << "Lines=" << lines << endl; f.close(); } void func_mmap(const char *name) { char *buffer; string fullname = string("bigfile_") + name; int f = open(fullname.c_str(), O_RDONLY); off_t size = lseek(f, 0, SEEK_END); lseek(f, 0, SEEK_SET); buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0); stringstream ss; ss.rdbuf()->pubsetbuf(buffer, size); int lines = 0; string str; while(getline(ss, str)) { lines++; } munmap(buffer, size); cout << "Lines=" << lines << endl; } 
+5
source share

The OS will read a whole block of data (depending on how the disk is formatted, usually 4-8k at a time) and do some buffering for you. Let the OS take care of this for you and read the data as it makes sense for your program.

+2
source share

Pounds are buffered reasonably. The basic requirements for an operating system hard drive are reasonably buffered. The hard drive itself has a reasonable buffer. You probably will not run more access to the hard drive if you read the file line by line. Or character by character, for that matter.

Therefore, there is no reason to load the entire file into a large buffer and work with this buffer, because it is already in the buffer. And often there is no reason to buffer one line at a time. Why allocate memory to buffer something on a line that is already buffered in ifstream? If you can, work on the thread directly, don’t worry, throwing everything around twice or more from one buffer to another. If it does not support readability and / or your profiler has told you that access to the disk significantly slows down your program.

+1
source share

If this is a small file on disk, then it is probably more efficient to read the entire file and analyze it line by line compared to reading one line at a time - this will require a lot of disk access.

0
source share

I believe that the C ++ idiom will read the file in turn and create a container based on the line when reading the file. Most likely, iostreams ( getline ) will be buffered enough so that you do not notice a significant difference.

However, for very large files, you can get better performance by reading large chunks of the file (not the entire file at the same time) and breaking the interval when new lines are found.

If you want to know exactly which method works faster and how much, you will need to profile your code.

0
source share

It is best to get all the data if it can be stored in memory, because whenever you request I / O, your program loses processing and sets Q's expectations.

enter image description here

However, if the file size is large, then it is better to read so much data in the time it takes to process. Since a larger read operation will take a long time to complete, then small ones. Processor switching time is much shorter than the entire time the file is read.

0
source share

All Articles