Disk I / O Optimization

Question

Disk I / O Optimization

I have a piece of code that analyzes data streams from very large (10-100 GB) binaries. It works well, so it's time to start optimizing, and the IO drive is currently the biggest bottleneck.

Two types of files are used. The first type of file consists of a stream of 16-bit integers, which must be scaled after I / O to convert to a floating-point value that is physically significant. I read the file in chunks, and I read in chunks of data, reading one 16-bit code at a time, doing the required scaling, and then storing the result in an array. Code below:

int64_t read_current_chimera(FILE *input, double *current, int64_t position, int64_t length, chimera *daqsetup) { int64_t test; uint16_t iv; int64_t i; int64_t read = 0; if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET)) { return 0; } for (i = 0; i < length; i++) { test = fread(&iv, sizeof(uint16_t), 1, input); if (test == 1) { read++; current[i] = chimera_gain(iv, daqsetup); } else { perror("End of file reached"); break; } } return read; }

The chimera_gain function simply takes a 16-bit integer, scales it, and returns double for storage.

The second type of file contains 64-bit doubles, but contains two columns, of which I only need the first. To do this, I confuse the pairs of doubles and discard the second. The twin must also be pre-replaced before use. The code I use for this is below:

 int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length) { int64_t test; double iv[2]; int64_t i; int64_t read = 0; if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET)) { return 0; } for (i = 0; i < length; i++) { test = fread(iv, sizeof(double), 2, input); if (test == 2) { read++; swapByteOrder((int64_t *)&iv[0]); current[i] = iv[0]; } else { perror("End of file reached: "); break; } } return read; }

Can someone suggest a method of reading these types of files that will be significantly faster than what I'm doing now?

+7

optimization c io

KBriggs Aug 19 '16 at 16:22

source share

1 answer

Evilteach · Accepted Answer · 2016-08-19T16:46:58+0000

First, it would be useful to use the r profile to identify hot spots in your program. Based on your description of the problem, you have a lot of overhead due to the sheer amount of items. Since files are large, there will be a big advantage to increasing the amount of data you read for io.

Convince yourself of this by putting together two small programs that read the stream.

 1) read it as you are in the example above, of 2 doubles. 2) read it the same way, but make it 10,000 doubles.

Time runs several times, and the likelihood that you will observe # 2 is much faster.

Good luck.

Disk I / O Optimization

More articles: