SSE and AVX performance with limited memory bandwidth

In the code below, I changed "dataLen" and got different performance.

dataLen = 400 SSE time: 758000 us Time in AVX: 483000 us SSE> AVX

dataLen = 2400 SSE time: 4212000 us AVX time: 2636000 us SSE> AVX

dataLen = 2864 SSE time: 6115000 us AVX time: 6146000 us SSE ~ = AVX

dataLen = 3200 SSE time: 8049000 us AVX time: 9297000 us SSE <AVX

dataLen = 4000 SSE time: 10170000us AVX time: 11690000us SSE <AVX

The SSE and AVX code can be simplified: buf3 [i] + = buf1 [1] * buf2 [i];

#include "testfun.h" #include <iostream> #include <chrono> #include <malloc.h> #include "immintrin.h" using namespace std::chrono; void testfun() { int dataLen = 4000; int N = 10000000; float *buf1 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32)); float *buf2 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32)); float *buf3 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32)); for(int i=0; i<dataLen; i++) { buf1[i] = 1; buf2[i] = 1; buf3[i] = 0; } //=========================SSE CODE===================================== system_clock::time_point SSEStart = system_clock::now(); __m128 p1, p2, p3; for(int j=0; j<N; j++) for(int i=0; i<dataLen; i=i+4) { p1 = _mm_load_ps(&buf1[i]); p2 = _mm_load_ps(&buf2[i]); p3 = _mm_load_ps(&buf3[i]); p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3); _mm_store_ps(&buf3[i], p3); } microseconds SSEtimeUsed = duration_cast<milliseconds>(system_clock::now() - SSEStart); std::cout << "SSE time used: " << SSEtimeUsed.count() << " us, " <<std::endl; //=========================AVX CODE===================================== for(int i=0; i<dataLen; i++) buf3[i] = 0; system_clock::time_point AVXstart = system_clock::now(); __m256 pp1, pp2, pp3; for(int j=0; j<N; j++) for(int i=0; i<dataLen; i=i+8) { pp1 = _mm256_load_ps(&buf1[i]); pp2 = _mm256_load_ps(&buf2[i]); pp3 = _mm256_load_ps(&buf3[i]); pp3 = _mm256_add_ps(_mm256_mul_ps(pp1, pp2), pp3); _mm256_store_ps(&buf3[i], pp3); } microseconds AVXtimeUsed = duration_cast<milliseconds>(system_clock::now() - AVXstart); std::cout << "AVX time used: " << AVXtimeUsed.count() << " us, " <<std::endl; _aligned_free(buf1); _aligned_free(buf2); } 

my cpu is an Intel Xeon E3-1225 v2 that has an L1 32KB * 4 cache (4 cores), when running this code it uses only 1 core, so the L1 cache used is 32 KB.

buf1 buf2 and buf3 are small enough to reside in L1 cache and L2 cache (1 MB L2 cache). Both SSE and AVX are limited in bandwidth, but with increasing dataLen, why does AVX take longer than SSE?

+3
performance caching sse avx
source share
2 answers

This is an interesting observation. I was able to reproduce your results. I was able to slightly improve the speed of SSE code by deploying a loop (see code below). Now for SSE, dataLen=2864 clearly faster, and for smaller values ​​it is close as fast as AVX. For larger values, this is still faster. This is due to the dependent dependency of the loop on your SSE code (i.e., loop unrolling increases the level of parallelism (ILP) commands). I did not try to deploy anymore. Deploying the AVX code did not help.

I do not have a clear answer to your question. I suspect this is due to ILP and the fact that AVX processors such as Sandy Bridge can only load two 128-bit words (SSE width) at a time, rather than two 256-bit words. Thus, in SSE code, it can perform one SSE addition, one SSE multiplication, two SSE loads, and one SSE storage at a time. For AVX, it can perform one AVX download (via two 128-bit loads on ports 2 and 3), one AVX multiplication, one AVX addition, and one 128-bit store (half the AVX width) at a time. In other words, while with AVX, multiplication and addition do twice as much work as SSE, the loads and storages still have a width of 128 bits. Perhaps this leads to a decrease in ILP with AVX compared to SSE, sometimes with code that is dominated by loads and stores?

For more information on ports and ILP, see Haswell, Sandy Bridge, Nehalem .

 __m128 p1, p2, p3, p1_v2, p2_v2, p3_v2; for(int j=0; j<N; j++) for(int i=0; i<dataLen; i+=8) { p1 = _mm_load_ps(&buf1[i]); p1_v2 = _mm_load_ps(&buf1[i+4]); p2 = _mm_load_ps(&buf2[i]); p2_v2 = _mm_load_ps(&buf2[i+4]); p3 = _mm_load_ps(&buf3[i]); p3_v2 = _mm_load_ps(&buf3[i+4]); p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3); p3_v2 = _mm_add_ps(_mm_mul_ps(p1_v2, p2_v2), p3_v2); _mm_store_ps(&buf3[i], p3); _mm_store_ps(&buf3[i+4], p3_v2); } 
+3
source share

I think these are the flaws of the Sandy Bdrige architecture caching system. I could reproduce the same result on an Ivy Brdige processor, but not on Haswell processors, but haswell has the same problem when using L3. I think these are the big flaws of AVX. Intel should fix this problem in the next step or next architecture.

 N = 1000000 datalen = 2000 SSE time used: 280000 us, AVX time used: 156000 us, N = 1000000 datalen = 4000 <- it still fast on Haswell using L2 SSE time used: 811000 us, AVX time used: 702000 us, N = 1000000 datalen = 6000 SSE time used: 1216000 us, AVX time used: 1076000 us, N = 1000000 datalen = 8000 SSE time used: 1622000 us, AVX time used: 1466000 us, N = 100000 <- reduced datalen = 20000 <- fit in L2 : 256K / 23 = 21845.3 SSE time used: 405000 us, AVX time used: 374000 us, N = 100000 datalen = 40000 <- need L3 SSE time used: 1185000 us, AVX time used: 1263000 us, N = 100000 datalen = 80000 SSE time used: 2340000 us, AVX time used: 2527000 us, 
+1
source share

All Articles