To measure the throughput of your application, you need to know how much memory is read and / or written, lets call it a numerator, and you need to know how long it takes to read and / or write, call it the denominator. Bandwidth is the numerator / denominator.
If the application is complex, it may not be easy to calculate how much memory is being read and / or written. In addition, if your application performs many other operations, it can be difficult to calculate. You will have to subtract the time of other operations. Therefore, when measuring maximum throughput, simple algorithms are usually used.
If you want to choose a comparison algorithm to try to compare it with your application, then you should see if your application only writes data, reads only data, or both read and write.
If you are only writing data, you can use the write (memset) test:
#pragam omp parallel for for(int i=0; i<n; i++) { x[i] = k; }
If you read and write data, you can run a simple test (memcpy)
#pragma omp parallel for for(int i=0; i<n; i++) { y[i] = x[i]; }
In fact, if you look at the STREAM source code, basically what it does for a copy test.
If you are only reading the data, you can do it as a shorthand (don't forget to compile with -ffast-math if you want this to be vectorized):
#pragma omp parallel for reduction(+:sum) for(int i=0; i<n; i++) { sum += x[i]*y[i]; }
The STREAM test is all read and write tests. I wrote my own bandwidth tool that writes only, reads and writes, and reads only.
Unfortunately, tests that record data will not be close to maximum throughput. The reason is that in order to write data, they must first read the data into the cache. For this reason, STREAM does not approach the maximum throughput of my system. To get the maximum write throughput, you need to make non-temporary stores that only write data without first reading into the cache.
For example, when using SSE and assuming x and y are floating point arrays, you can run a read and write test as follows:
#pragma omp parallel for for(int i=0; i<n/4; i++) { __m128 v = _mm_load_ps(&x[4*i]); _mm_stream_ps(&y[4*i], v); }
If you look at Agner Fog asmlib , you will see that this is exactly what it does for memset and memcpy for large arrays. Actually its asmlib and this example I just gave you 85% (45 GB / s out of 51 GB / s) of bandwidth on my system , while STREAM tests get about 45% of the bandwidth .
These tests assume that your algorithm is memory-related and for comparison you are reading an array that is much larger than the slowest cache. If your algorithm reuses data that is still in the cache, then the read tests will not be close to the maximum bandwidth due to the loopback associated with this. To fix this, you need to turn around 3-10 times depending on the work and equipment. In addition, if you make entries for arrays that fit into the cache, which you will reuse, you will not want to do non-temporary stores. Therefore, Agner Fog asmlib uses non-temporary storage for large arrays.