Is it SIMD worth it? Is there a better option?

I have some code that works pretty well, but I would like it to work better. The main problem I am facing is that it must have a nested loop. The external one is for iterations (which should happen sequentially), and the internal one is for each point particle under consideration. I know that I can do little with the outside, but I wonder if there is a way to optimize something like:

void collide(particle particles[], box boxes[], double boxShiftX, double boxShiftY) {/*{{{*/ int i; double nX; double nY; int boxnum; for(i=0;i<PART_COUNT;i++) { boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+ BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT)); //copied and pasted the macro which is why it kinda odd looking particles[i].vX -= boxes[boxnum].mX; particles[i].vY -= boxes[boxnum].mY; if(boxes[boxnum].rotDir == 1) { nX = particles[i].vX*Wxx+particles[i].vY*Wxy; nY = particles[i].vX*Wyx+particles[i].vY*Wyy; } else { //to make it randomly pick a rot. direction nX = particles[i].vX*Wxx-particles[i].vY*Wxy; nY = -particles[i].vX*Wyx+particles[i].vY*Wyy; } particles[i].vX = nX + boxes[boxnum].mX; particles[i].vY = nY + boxes[boxnum].mY; } }/*}}}*/ 

I looked at SIMD, although I can’t learn much about it, and I’m not quite sure that the processing needed to correctly extract and pack the data will be worth half the number of instructions, since it seems that they can be used simultaneously only two doubles.

I tried splitting it into multiple threads using shm and pthread_barrier (to synchronize the different steps, one of which is the code above), but it just made it slower.

My current code is pretty fast; it is on the order of one second for every 10M particles * iterations, and from what I can say from gprof, 30% of my time is spent only on this function (5000 calls, PART_COUNT = 8192 particles take 1.8 seconds). I'm not worried about small, constant things, it's just that 512K particles * 50K iterations * 1000 experiments took more than a week for the last time.

I think my question is is there any way to deal with these long vectors that are more efficient than just sorting through them. I feel it should be, but I can’t find it.

+6
optimization c simd
source share
5 answers

I'm not sure how much SIMD will win; the inner loop is quite small and simple, so I would suggest (just by looking) that you are probably more connected with memory than anything else. With this in mind, I would try to rewrite the main part of the loop so as not to touch the array of particles more than necessary:

 const double temp_vX = particles[i].vX - boxes[boxnum].mX; const double temp_vY = particles[i].vY - boxes[boxnum].mY; if(boxes[boxnum].rotDir == 1) { nX = temp_vX*Wxx+temp_vY*Wxy; nY = temp_vX*Wyx+temp_vY*Wyy; } else { //to make it randomly pick a rot. direction nX = temp_vX*Wxx-temp_vY*Wxy; nY = -temp_vX*Wyx+temp_vY*Wyy; } particles[i].vX = nX; particles[i].vY = nY; 

This has a small potential side effect, namely that it does not make an additional supplement at the end.


Another potential speedup would be to use __restrict in an array of particles so that the compiler can better optimize records at speeds. Also, if Wxx, etc. They are global variables; they may need to be reloaded every time through a loop instead of being possible to be stored in the register; using __restrict will help too.


Since you access the particles in order, you can try to pre-program (for example, __builtin_prefetch on GCC) a few particles ahead to reduce cache misses. Box prefetching is a bit more complicated as you access them in an unpredictable manner; you can try something like

 int nextBoxnum = ((((int)(particles[i+1].sX+boxShiftX) /// etc... // prefetch boxes[nextBoxnum] 

The last thing I just noticed is that if box :: rotDir is always +/- 1.0, then you can exclude the comparison and the branch in the inner loop as follows:

 const double rot = boxes[boxnum].rotDir; // always +/- 1.0 nX = particles[i].vX*Wxx + rot*particles[i].vY*Wxy; nY = rot*particles[i].vX*Wyx + particles[i].vY*Wyy; 

Naturally, the usual profiling warnings apply before and after. But I think all of this can help and can be done regardless of whether you switch to SIMD.

+6
source share

For record only, there is also libSIMDx86!

http://simdx86.sourceforge.net/Modules.html

(When compiling, you can also try: gcc -O3 -msse2 or the like).

+3
source share
 ((int)(particles[i].sX+boxShiftX))/BOX_SIZE 

It is expensive if sX is int (can't say). Truncate boxShiftX / Y to int before entering the loop.

+2
source share

Do you have enough profiling to tell you where time is spent on this feature?

For example, are you sure these are not your divs and mods in calculating boxnum, where is the time spent? Sometimes compilers cannot determine possible alternatives to shift / AND, even if a person (or at least the one who knew BOX_SIZE and BWIDTH / BHEIGHT that I don't know) could have the opportunity.

It would be a pity to spend a lot of time on SIMDifying the wrong bit of code ...

Another thing worth paying attention to is that the work can be forced into something that can work with a library, such as IPP, that will make informed decisions about how to best use the processor.

+1
source share

There are too many memory instructions, integers, and branches in your algorithm to have enough independent flops to profit from SIMD. The pipeline will constantly stop.

Finding a more efficient way of randomization will be the top of the list. Then try working either in float or int, but not both. Reinstall conditional expressions as arithmetic operations, or at least as a selection operation. Only then will SIMD become a realistic offer

+1
source share

All Articles