Implementing My (simd) is time consuming, although it is done for fixed input. Runtimes range from 100 million clock cycles to 120 million clock cycles. The program calls the function about 600 times, and the most expensive part of the function is that it accesses the memory about 2000 times. Thus, the overall involvement of memory in a rather high my program.
Is a runtime change due to memory access patterns / contents of the original memory?
I used valgrind to analyze the profile of my program. It shows that each memory access takes about 8 instructions. This is normal?
Below is a snippet of code (function) called 600 times. Mulprev [32] [20] is the array accessed the most times.
j = 15;
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)
{
l = j << 2;
for (i = 0; i < 20; i++)
{
val1v = _mm_load_si128 ((__m128i *) &elm1v[i]);
uv = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
u1 = _mm_extract_epi16 (uv, 0);
u2 = _mm_extract_epi16 (uv, 4) + 16;
for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
{
temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]);
temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);
val1v = _mm_load_si128 ((__m128i *) &res[ival]);
val2v = _mm_load_si128 ((__m128i *) &res[ival1]);
bv = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
av = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));
_mm_store_si128 ((__m128i *) &res[ival], bv);
_mm_store_si128 ((__m128i *) &res[ival1], av);
}
}
if (j == 0)
break;
val0v = _mm_setzero_si128 ();
for (i = 0; i < 40; i++)
{
testv = _mm_load_si128 ((__m128i *) &res[i]);
val1v = _mm_srli_epi64 (testv, 60);
val2v = _mm_xor_si128 (val0v, _mm_slli_epi64 (testv, 4));
_mm_store_si128 (&res[i], val2v);
val0v = val1v;
}
j--;
}
. ?