When trying to figure out if my internal code loop falls into the hardware design barrier or lack of understanding on my side of the barrier. There's a bit more, but the simplest question I can answer is this:
If I have the following code:
float px[32768],py[32768],pz[32768]; float xref, yref, zref, deltax, deltay, deltaz; initialize_with_random(px); initialize_with_random(py); initialize_with_random(pz); for(i=0;i<32768-1;i++) { xref=px[i]; yref=py[i]; zref=pz[i]; for(j=0;j<32768-1;j++ { deltx=xref-px[j]; delty=yref-py[j]; deltz=zref-pz[j]; } }
What type of maximum theoretical speed could I see by going to SSE instructions in a situation where I have full control over the code (assembly, built-in functions, whatever), but not control over the runtime other than the architecture (i.e. a multi-user environment, so I canβt do anything about how the OS kernel assigns time to my specific process).
Now I see 3 times acceleration with my code when I would think that using SSE would give me much more vector depth than 3 times speed indicates (supposedly 3 times speed tells me that I have 4x maximum theoretical bandwidth ) (I tried things like letting deltx / delty / deltz be arrays in case the compiler wasnβt smart enough to automatically advance them, but I still only see 3x speedup.) I use the Intel C compiler with the appropriate flags compiler for vectorization, but obviously not internal.
Justin hooper
source share