I am doing research for my university related to the image reconstruction algorithm for medical use.
I'm stuck in something until 3 weeks, I need to improve the performance of the following code:
for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
LOR_X = P.symmLOR[lor].x;
LOR_Y = P.symmLOR[lor].y;
LOR_XY = P.symmLOR[lor].xy;
lor_z = P.symmLOR[lor].z;
LOR_Z_X = P.symmLOR[lor_z].x;
LOR_Z_Y = P.symmLOR[lor_z].y;
LOR_Z_XY = P.symmLOR[lor_z].xy;
s0 = P.a2r[lor];
s1 = P.a2r[lor+1];
for (s=s0; s < s1; s++)
{
pixel = P.a2b[s];
v = P.a2p[s];
b[lor] += v * x[pixel];
p = P.symm_Xpixel[pixel];
b[LOR_X] += v * x[p];
p = P.symm_Ypixel[pixel];
b[LOR_Y] += v * x[p];
p = P.symm_XYpixel[pixel];
b[LOR_XY] += v * x[p];
// do Z symmetry.
pixel_z = P.symm_Zpixel[pixel];
b[lor_z] += v * x[pixel_z];
p = P.symm_Xpixel[pixel_z];
b[LOR_Z_X] += v * x[p];
p = P.symm_Ypixel[pixel_z];
b[LOR_Z_Y] += v * x[p];
p = P.symm_XYpixel[pixel_z];
b[LOR_Z_XY] += v * x[p];
}
}
for those who want to know, the code implements the MLEM forwarding function , and all variables are FLOAT .
After several tests, I noticed that there was a big delay in this part of the code. (you know, rule 90-10).
Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirms that performance is reduced due to more misses, especially for random access to the b-vector (huge size).
: .
, , , , (www.akkadia.org/drepper/cpumemory.pdf ) A.1 .
, SpMV ( -) .
, , b-, .
b SIMD ?
, ββ, void _mm_stream_ps (float * p, __m128 a), float b Cache?
_mm_stream_ps, 4 , b .
, .
: v - Sparse Matrix CRS. , , CRS , , , , , b. 400.000.000 L1D Misses 100 ~ Misses, b.
.