Improving I / O performance for merging two files in C

I wrote a function that combines two large files ( file1,file2 ) into a new file ( outputFile ). Each file is a string format, and records are separated by byte \ 0. Both files have the same number of null bytes.

One example file with two entries might look like this: A\nB\n\0C\nZ\nB\n\0

  Input: file1: A\nB\0C\nZ\nB\n\0 file2: BBA\nAB\0T\nASDF\nQ\n\0 Output outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0 FILE * outputFile = fopen(...); setvbuf ( outputFile , NULL , _IOFBF , 1024*1024*1024 ) FILE * file1 = fopen(...); FILE * file2 = fopen(...); int c1, c2; while((c1=fgetc(file1)) != EOF) { if(c1 == '\0'){ while((c2=fgetc(file2)) != EOF && c2 != '\0') { fwrite(&c2, sizeof(char), 1, outputFile); } char nullByte = '\0'; fwrite(&nullByte, sizeof(char), 1, outputFile); }else{ fwrite(&c1, sizeof(char), 1, outputFile); } } 

Is there a way to improve this I / O performance of this feature? I increased the outputFile buffer outputFile to 1 GB using setvbuf . Would it help to use posix_fadvise in files1 and file2?

+6
source share
4 answers

You are doing IO by character. It will be useless and painful SLOW, even with buffered streams.

Take advantage of the fact that your data is stored in your files as strings with zero completion.

Assuming you alternate the lines with nul-terminated from each file and run on the POSIX platform so that you can just enter mmap() to enter the files:

 typedef struct mapdata { const char *ptr; size_t bytes; } mapdata_t; mapdata_t mapFile( const char *filename ) { mapdata_t data; struct stat sb; int fd = open( filename, O_RDONLY ); fstat( fd, &sb ); data.bytes = sb.st_size; /* assumes we have a NUL byte after the file data If the size of the file is an exact multiple of the page size, we won't have the terminating NUL byte! */ data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 ); close( fd ); return( data ); } void unmapFile( mapdata_t data ) { munmap( data.ptr, data.bytes ); } void mergeFiles( const char *file1, const char *file2, const char *output ) { char zeroByte = '\0'; mapdata_t data1 = mapFile( file1 ); mapdata_t data2 = mapFile( file2 ); size_t strOffset1 = 0UL; size_t strOffset2 = 0UL; /* get a page-aligned buffer - a 64kB alignment should work */ char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL ); /* memset the buffer to ensure the virtual mappings exist */ memset( iobuffer, 0, 1024UL * 1024UL ); /* use of direct IO should reduce memory pressure - the 1 MB buffer is already pretty large, and since we're not seeking the page cache is really only slowing things down */ int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 ); FILE *outputfile = fdopen( fd, "wb" ); setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL ); /* loop until we reach the end of either mapped file */ for ( ;; ) { fputs( data1.ptr + strOffset1, outputfile ); fwrite( &zeroByte, 1, 1, outputfile ); fputs( data2.ptr + strOffset2, outputfile ); fwrite( &zeroByte, 1, 1, outputfile ); /* skip over the string, assuming there one NUL byte in between strings */ strOffset1 += 1 + strlen( data1.ptr + strOffset1 ); strOffset2 += 1 + strlen( data2.ptr + strOffset2 ); /* if either offset is too big, end the loop */ if ( ( strOffset1 >= data1.bytes ) || ( strOffset2 >= data2.bytes ) ) { break; } } fclose( outputfile ); unmapFile( data1 ); unmapFile( data2 ); } 

I did not check for errors at all. You will also need to add the appropriate header files.

Please also note that the file data is NOT considered an exact multiple of the size of the system page, thus ensuring that NUL bytes are displayed after the contents of the file. If the file size is an exact multiple of the page size, you must mmap() add an extra page after the contents of the file to make sure that there is a NUL byte to complete the last line.

Or you can rely on the fact that there is a NUL byte as the last byte of the contents of the file. If this ever turns out to be wrong, you are likely to get either SEGV or corrupted data.

+1
source
  • you use two function calls for each character (one for input, one for output). Function calls are slow (they pollute the instruction pipeline).
  • fgetc () and fputc have their own getc () / putc () analogues, which (can be) implemented as macros, allowing the compiler to embed the entire cycle, except for reading / writing buffers, twice per 512 or 1024 or 4096 characters are processed. (they will cause system calls, but in any case they are inevitable)
  • using read / write instead of buffered I / O is probably not worth the effort; additional bookkeeping wil will make your loop more dense (by the way, using fwrite () to write a single character is certainly useless, the same goes for write ())
  • perhaps a larger output buffer might help, but I would not count on it.
0
source

A slight improvement would be that if you are going to write individual characters, you should use fputc , not fwrite .

Also, since you care about speed, you should try putc and getc , not fputc and fgetc , to see if it works faster.

-one
source

If you can use streams, do one for file1 and the other for file2.

Make outputFile as large as you need, then make thread1 write file1 to outputFile .

While thread2 searches for output outputFile of file length1 + 1 and writes file2

Edit:

This is not the right answer for this case , but to prevent confusion I will allow it here.

More discourses I found about this: improve performance in IO file in C

-2
source

All Articles