Improve terrain and reduce cache pollution during medical image reconstruction

I am doing research for my university related to the image reconstruction algorithm for medical use.

I'm stuck in something until 3 weeks, I need to improve the performance of the following code:

for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
  LOR_X = P.symmLOR[lor].x;
  LOR_Y = P.symmLOR[lor].y;
  LOR_XY = P.symmLOR[lor].xy;
  lor_z = P.symmLOR[lor].z;
  LOR_Z_X = P.symmLOR[lor_z].x;
  LOR_Z_Y = P.symmLOR[lor_z].y;
  LOR_Z_XY = P.symmLOR[lor_z].xy;  

  s0 = P.a2r[lor];
  s1 = P.a2r[lor+1];

  for (s=s0; s < s1; s++)
  {
    pixel     = P.a2b[s];
    v         = P.a2p[s]; 

    b[lor]    += v * x[pixel];

    p          = P.symm_Xpixel[pixel];
    b[LOR_X]  += v * x[p];

    p          = P.symm_Ypixel[pixel];
    b[LOR_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel];
    b[LOR_XY] += v * x[p];


    // do Z symmetry.
    pixel_z    = P.symm_Zpixel[pixel];
    b[lor_z]  += v * x[pixel_z];


    p          = P.symm_Xpixel[pixel_z];
    b[LOR_Z_X]  += v * x[p];


    p          = P.symm_Ypixel[pixel_z];
    b[LOR_Z_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel_z];
    b[LOR_Z_XY] += v * x[p];

   }

}

for those who want to know, the code implements the MLEM forwarding function , and all variables are FLOAT .

After several tests, I noticed that there was a big delay in this part of the code. (you know, rule 90-10).

Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirms that performance is reduced due to more misses, especially for random access to the b-vector (huge size).

: .

, , , , (www.akkadia.org/drepper/cpumemory.pdf ) A.1 .

, SpMV ( -) .

, , b-, .

b SIMD ?

, ​​, void _mm_stream_ps (float * p, __m128 a), float b Cache?

_mm_stream_ps, 4 , b .

, .

: v - Sparse Matrix CRS. , , CRS , , , , , b. 400.000.000 L1D Misses 100 ~ Misses, b.

.

+5
4

, .

  • const .
  • , ( LOR_..) , - : float LOR_X = P.symmLOR[lor].x; size_t s0 = P.a2r[lor];
  • , , , , C99, : for (size_t s=s0; s < s1; s++)
  • b . , , s. , .
  • , . , .
  • , . a "" .

: , , , , - .

+2

b , b .

, B, , for , B.

, , , .

+2

, , :) : , b, x P.symm , b , x P.symm. , P . . :

  • __restrict b. , b , ( ) .

  • , P.symm , x, , , b. , P.symm , x , , b .

( β„– 2) , P. , , . p_x, p_y, p_xy .., .

Once all this is in place, you can begin the briefing (i.e. __builtin_prefetchgcc) before the known cache misses.

Hope this helps.

+2
source

These are good answers, and I would ask why there is so much indexing? by index values ​​that do not change locally?

In addition, he will not kill you to make a few random pauses to see where it is usually.

0
source

All Articles