Multi-threaded application bandwidth measurement

Question

Multi-threaded application bandwidth measurement

What is the easiest and most efficient way to measure the bandwidth that my application uses (multithreaded, written using OpenMP)? I launched STREAM to get the maximum. stable bandwidth, and now I would like to know if I saturate all the available bandwidth or not.

I found a couple of related questions (for example, Measuring the bandwidth of the main memory ), but I could not find the answer to this question;

Unfortunately, I cannot use VTune, but I can use PAPI counters;

My main goal is to find out if the poor scalability of my application is due to memory bandwidth saturation.

thanks

+7

performance multithreading ram openmp bandwidth

a3mlord Aug 25 '14 at 22:59

source share

2 answers

Jonathan dursi · Answer 1 · 2014-08-26T15:06:55+0000

There are several ways to obtain (from the command line) bandwidth throughout the application, but it seems that there are several cores that you would like to see separately. In this case, wrapping parts of your code with PAPI calls is a perfectly reasonable way.

You can use the PAPI event counters on your system ( papi_avail ) to find the total number of load / store instructions, and if you know the size of your downloads / storages, you can get the memory bandwidth. Alternatively, you can count on getting into your caches and multiply line sizes to determine the actual amount of data transferred across the system. There is documentation in various places on the PAPI wiki, for example. here for a high-level interface, and here are some useful formulas for useful derived quantities.

Here's an example of simple coding that performs matrix-vector multiplication in a reasonable way and nonequivalently transposes the cache. Note that calling PAPI_read_counters resets the counters that we want here.

 #include <stdio.h> #include <stdlib.h> typedef char * caddr_t; #include <papi.h> #include <sys/time.h> int init(float ***a, float **x, float **y, int size); void report_results(char *tname, long_long *values, const int n, double wtime); void sensible_matvec(float **a, float *x, float *y, int size); void wrong_order_matvec(float **a, float *x, float *y, int size); void tick(struct timeval *t); double tock(struct timeval *t); #define NUM_EVENTS 3 int main(int argc, char **argv) { const int matsize = 4096; float **a, *x, *y; init(&a, &x, &y, matsize); int events[NUM_EVENTS] = {PAPI_L1_DCM, PAPI_LST_INS, PAPI_FP_INS}; long_long values[NUM_EVENTS]; double walltime; struct timeval t; if (PAPI_start_counters(events, NUM_EVENTS) != PAPI_OK) { fprintf(stderr, "Error starting PAPI counters; aborting\n"); exit(1); } tick(&t); sensible_matvec(a, x, y, matsize); PAPI_read_counters(values, NUM_EVENTS); walltime = tock(&t); report_results("Sensible", values, NUM_EVENTS, walltime); tick(&t); wrong_order_matvec(a, x, y, matsize); PAPI_stop_counters(values, NUM_EVENTS); walltime = tock(&t); report_results("Wrong order", values, NUM_EVENTS, walltime); return 0; } void report_results(char *tname, long_long *values, const int n, double wtime) { long_long total_mem = values[1]; long_long total_flops = values[2]; long_long l1misses = values[0]; printf("Test %s: time elapsed = %f, memory accesses = %lld, flop = %lld\n", tname, wtime, total_mem, total_flops); printf("\tMemory bandwidth (MB/sec) = %f\n", 1.0*total_mem*sizeof(float)/(wtime*1024*1024)); printf("\tL1 cache miss rate = %f\n", 1.0*l1misses/total_mem); printf("\tMFLOPS = %lf\n\n", 1.0*total_flops/(wtime*1024*1024)); } int alloc2d(float ***a, int n); int free2d(float ***a, int n); int alloc1d(float **x, int n); int free1d(float **x, int n); int init(float ***a, float **x, float **y, int size) { if (alloc2d(a,size)) return -2; if (alloc1d(x,size)) { free2d(a,size); return -2; } if (alloc1d(y,size)) { free2d(a,size); free1d(x,size); return -3; } for (int i=0; i<size; i++) { (*x)[i] = (float)i; (*y)[i] = 0.; } for (int i=0; i<size; i++) { for (int j=0; j<size; j++) { (*a)[i][j] = i; } } return 0; } void sensible_matvec(float **a, float *x, float *y, int size) { for (int i=0; i<size; i++) { for (int j=0; j<size; j++) { y[i] += a[i][j]*x[j]; } } } void wrong_order_matvec(float **a, float *x, float *y, int size) { for (int j=0; j<size; j++) { for (int i=0; i<size; i++) { y[i] += a[i][j]*x[j]; } } } void tick(struct timeval *t) { gettimeofday(t, NULL); } double tock(struct timeval *t) { struct timeval now; gettimeofday(&now, NULL); return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.); } void freeall(float ***a, float **x, float **y, int size) { free2d(a, size); free1d(x, size); free1d(y, size); return; } int alloc2d(float ***a, int n) { float *data = (float *)malloc(n*n*sizeof(float)); if (data == NULL) return -1; *a = (float **)malloc(n*sizeof(float *)); if (*a == NULL) {free(data); return -1;}; for (int i=0; i<n; i++) (*a)[i] = &(data[i*n]); return 0; } int free2d(float ***a, int n) { free (&((*a)[0][0])); free(*a); return 0; } int alloc1d(float **a, int n) { *a = (float *)malloc(n*sizeof(float)); if (*a == NULL) return -1; return 0; } int free1d(float **a, int n) { free(*a); return 0; }

Running gives:

 $ gcc -o papi-test papi-test.c -I${PAPI_INC_DIR} -L${PAPI_LIB_DIR} -lpapi -Wall -std=c99 $ ./papi-test Test Sensible: time elapsed = 0.121877, memory accesses = 302020775, flop = 33580481 Memory bandwidth (MB/sec) = 9453.119330 L1 cache miss rate = 0.003921 MFLOPS = 262.763624 Test Wrong order: time elapsed = 0.537639, memory accesses = 302026751, flop = 39629352 Memory bandwidth (MB/sec) = 2142.963254 L1 cache miss rate = 0.094045 MFLOPS = 70.295301

Z boson · Answer 2 · 2014-08-26T08:24:03+0000

To measure the throughput of your application, you need to know how much memory is read and / or written, lets call it a numerator, and you need to know how long it takes to read and / or write, call it the denominator. Bandwidth is the numerator / denominator.

If the application is complex, it may not be easy to calculate how much memory is being read and / or written. In addition, if your application performs many other operations, it can be difficult to calculate. You will have to subtract the time of other operations. Therefore, when measuring maximum throughput, simple algorithms are usually used.

If you want to choose a comparison algorithm to try to compare it with your application, then you should see if your application only writes data, reads only data, or both read and write.

If you are only writing data, you can use the write (memset) test:

 #pragam omp parallel for for(int i=0; i<n; i++) { x[i] = k; }

If you read and write data, you can run a simple test (memcpy)

 #pragma omp parallel for for(int i=0; i<n; i++) { y[i] = x[i]; }

In fact, if you look at the STREAM source code, basically what it does for a copy test.

If you are only reading the data, you can do it as a shorthand (don't forget to compile with -ffast-math if you want this to be vectorized):

 #pragma omp parallel for reduction(+:sum) for(int i=0; i<n; i++) { sum += x[i]*y[i]; }

The STREAM test is all read and write tests. I wrote my own bandwidth tool that writes only, reads and writes, and reads only.

Unfortunately, tests that record data will not be close to maximum throughput. The reason is that in order to write data, they must first read the data into the cache. For this reason, STREAM does not approach the maximum throughput of my system. To get the maximum write throughput, you need to make non-temporary stores that only write data without first reading into the cache.

For example, when using SSE and assuming x and y are floating point arrays, you can run a read and write test as follows:

 #pragma omp parallel for for(int i=0; i<n/4; i++) { __m128 v = _mm_load_ps(&x[4*i]); _mm_stream_ps(&y[4*i], v); }

If you look at Agner Fog asmlib , you will see that this is exactly what it does for memset and memcpy for large arrays. Actually its asmlib and this example I just gave you 85% (45 GB / s out of 51 GB / s) of bandwidth on my system , while STREAM tests get about 45% of the bandwidth .

These tests assume that your algorithm is memory-related and for comparison you are reading an array that is much larger than the slowest cache. If your algorithm reuses data that is still in the cache, then the read tests will not be close to the maximum bandwidth due to the loopback associated with this. To fix this, you need to turn around 3-10 times depending on the work and equipment. In addition, if you make entries for arrays that fit into the cache, which you will reuse, you will not want to do non-temporary stores. Therefore, Agner Fog asmlib uses non-temporary storage for large arrays.

Multi-threaded application bandwidth measurement

More articles: