getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and what it is called, depends on the design of the C library. But, most likely, there is no clear difference in reading a line at a time compared to the whole file, because the OS at the lower level will read (at least) one disk block at a time, and most likely at least a "page" "(4 KB), if not more.
Also, unles you almost do nothing with your line after you read it (for example, you write something like "grep", so basically just read to find the line), it is unlikely that the overhead of reading the line while most of the time you spend.
But “downloading the entire file at a time” has several clear problems:
- You do not start processing until you read the entire file.
- You need enough memory to read the entire file into memory - what if the file size is several hundred GB? Does your program not work?
Do not try to optimize something if you have not used profiling to prove that this is part of why your code is slow. You just cause more problems for yourself.
Edit: So, I wrote a program to measure this, since I find it quite interesting.
And the results are certainly interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all the source files in a directory with about a dozen different source files, and then copying this file several times “multiply” it while it it doesn’t take more than 1.5 seconds to run the test, and for how long I think you need to run everything to make sure that the time is not too susceptible to a random “network packet” or other external influences taking time from the process).
I also decided to measure the system and user time in the process.
$ ./bigfile Lines=24812608 Wallclock time for mmap is 1.98 (user:1.83 system: 0.14) Lines=24812608 Wallclock time for getline is 2.07 (user:1.68 system: 0.389) Lines=24812608 Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723) $ ./bigfile Lines=24812608 Wallclock time for mmap is 1.96 (user:1.83 system: 0.12) Lines=24812608 Wallclock time for getline is 2.07 (user:1.67 system: 0.392) Lines=24812608 Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)
There are three different functions for reading a file (there is a code for measuring time and so on, of course, but to reduce the size of this message I prefer not to publish all this - and I played with the order to see if it changed, so the results above are not in the same order as the functions here)
void func_readwhole(const char *name) { string fullname = string("bigfile_") + name; ifstream f(fullname.c_str()); if (!f) { cerr << "could not open file for " << fullname << endl; exit(1); } f.seekg(0, ios::end); streampos size = f.tellg(); f.seekg(0, ios::beg); char* buffer = new char[size]; f.read(buffer, size); if (f.gcount() != size) { cerr << "Read failed ...\n"; exit(1); } stringstream ss; ss.rdbuf()->pubsetbuf(buffer, size); int lines = 0; string str; while(getline(ss, str)) { lines++; } f.close(); cout << "Lines=" << lines << endl; delete [] buffer; } void func_getline(const char *name) { string fullname = string("bigfile_") + name; ifstream f(fullname.c_str()); if (!f) { cerr << "could not open file for " << fullname << endl; exit(1); } string str; int lines = 0; while(getline(f, str)) { lines++; } cout << "Lines=" << lines << endl; f.close(); } void func_mmap(const char *name) { char *buffer; string fullname = string("bigfile_") + name; int f = open(fullname.c_str(), O_RDONLY); off_t size = lseek(f, 0, SEEK_END); lseek(f, 0, SEEK_SET); buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0); stringstream ss; ss.rdbuf()->pubsetbuf(buffer, size); int lines = 0; string str; while(getline(ss, str)) { lines++; } munmap(buffer, size); cout << "Lines=" << lines << endl; }