I have a piece of code that analyzes data streams from very large (10-100 GB) binaries. It works well, so it's time to start optimizing, and the IO drive is currently the biggest bottleneck.
Two types of files are used. The first type of file consists of a stream of 16-bit integers, which must be scaled after I / O to convert to a floating-point value that is physically significant. I read the file in chunks, and I read in chunks of data, reading one 16-bit code at a time, doing the required scaling, and then storing the result in an array. Code below:
int64_t read_current_chimera(FILE *input, double *current, int64_t position, int64_t length, chimera *daqsetup) { int64_t test; uint16_t iv; int64_t i; int64_t read = 0; if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET)) { return 0; } for (i = 0; i < length; i++) { test = fread(&iv, sizeof(uint16_t), 1, input); if (test == 1) { read++; current[i] = chimera_gain(iv, daqsetup); } else { perror("End of file reached"); break; } } return read; }
The chimera_gain function simply takes a 16-bit integer, scales it, and returns double for storage.
The second type of file contains 64-bit doubles, but contains two columns, of which I only need the first. To do this, I confuse the pairs of doubles and discard the second. The twin must also be pre-replaced before use. The code I use for this is below:
int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length) { int64_t test; double iv[2]; int64_t i; int64_t read = 0; if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET)) { return 0; } for (i = 0; i < length; i++) { test = fread(iv, sizeof(double), 2, input); if (test == 2) { read++; swapByteOrder((int64_t *)&iv[0]); current[i] = iv[0]; } else { perror("End of file reached: "); break; } } return read; }
Can someone suggest a method of reading these types of files that will be significantly faster than what I'm doing now?