Why does my code cause command cache misses?

Question

Why does my code cause command cache misses?

According to cachegrind, this checksum calculation procedure is one of the biggest contributions to loading the command cache and skipping the command cache throughout the application:

#include <stdint.h> namespace { uint32_t OnesComplementSum(const uint16_t * b16, int len) { uint32_t sum = 0; uint32_t a = 0; uint32_t b = 0; uint32_t c = 0; uint32_t d = 0; // helper for the loop unrolling auto run8 = [&] { a += b16[0]; b += b16[1]; c += b16[2]; d += b16[3]; b16 += 4; }; for (;;) { if (len > 32) { run8(); run8(); run8(); run8(); len -= 32; continue; } if (len > 8) { run8(); len -= 8; continue; } break; } sum += (a + b) + (c + d); auto reduce = [&]() { sum = (sum & 0xFFFF) + (sum >> 16); if (sum > 0xFFFF) sum -= 0xFFFF; }; reduce(); while ((len -= 2) >= 0) sum += *b16++; if (len == -1) sum += *(const uint8_t *)b16; // add the last byte reduce(); return sum; } } // anonymous namespace uint32_t get(const uint16_t* data, int length) { return OnesComplementSum(data, length); }

See here asm output.

This may be caused by the loop unrolling, but the generated object code does not seem too excessive.

How can I improve the code?

Update

Because the checksum function was in an anonymous namespace, it was built-in and duplicated by two functions that were in the same cpp file.
Deploying a loop is still useful. Removing the code slowed down the code.
Improving the infinite loop speeds up the code (but for some reason I get the opposite results on my mac).
- Before corrections: here you can see two checksums and 17,210 passes L1 IR
- After corrections: after fixing the problem of embedding and fixing an infinite loop, the cache of the L1 instruction cache fell to 8324.
- "InstructionFetch" above in a fixed example. I am not sure how to interpret this. Does this mean that when most of the activity occurs? Or does it hint at a problem?

+5

c ++ performance cpu-cache

Stackedcrooked Apr 4 '15 at 16:18

source share

1 answer

Glenn teitelbaum · Answer 1 · 2015-07-29T23:19:44+0000

replace the main loop as follows:

 const int quick_len=len/8; const uint16_t * const the_end=b16+quick_len*4; len -= quick_len*8; for (; b16+4 <= the_end; b16+=4) { a += b16[0]; b += b16[1]; c += b16[2]; d += b16[3]; }

It seems there is no need to manually unroll the loop if you use -O3

In addition, the test case allowed for too much optimization, since the input was static and the results were not used, as well as printing the result helps to verify that the optimized versions do not violate anything

The full test that I used:

 int main(int argc, char *argv[]) { using namespace std::chrono; auto start_time = steady_clock::now(); int ret=OnesComplementSum((const uint8_t*)(s.data()+argc), s.size()-argc, 0); auto elapsed_ns = duration_cast<nanoseconds>(steady_clock::now() - start_time).count(); std::cout << "loop=" << loop << " elapsed_ns=" << elapsed_ns << " = " << ret<< std::endl; return ret; }

Comparison with theis ( CLEAN LOOP ) and an improved version ( UGLY LOOP ) and a longer test line:

 loop=CLEAN_LOOP elapsed_ns=8365 = 14031 loop=CLEAN_LOOP elapsed_ns=5793 = 14031 loop=CLEAN_LOOP elapsed_ns=5623 = 14031 loop=CLEAN_LOOP elapsed_ns=5585 = 14031 loop=UGLY_LOOP elapsed_ns=9365 = 14031 loop=UGLY_LOOP elapsed_ns=8957 = 14031 loop=UGLY_LOOP elapsed_ns=8877 = 14031 loop=UGLY_LOOP elapsed_ns=8873 = 14031

Check here: http://coliru.stacked-crooked.com/a/52d670039de17943

EDIT:

In fact, the entire function can be reduced to:

 uint32_t OnesComplementSum(const uint8_t* inData, int len, uint32_t sum) { const uint16_t * b16 = reinterpret_cast<const uint16_t *>(inData); const uint16_t * const the_end=b16+len/2; for (; b16 < the_end; ++b16) { sum += *b16; } sum = (sum & uint16_t(-1)) + (sum >> 16); return (sum > uint16_t(-1)) ? sum - uint16_t(-1) : sum; }

Which is better than OP with -O3, but worse with -O2:

http://coliru.stacked-crooked.com/a/bcca1e94c2f394c7

 loop=CLEAN_LOOP elapsed_ns=5825 = 14031 loop=CLEAN_LOOP elapsed_ns=5717 = 14031 loop=CLEAN_LOOP elapsed_ns=5681 = 14031 loop=CLEAN_LOOP elapsed_ns=5646 = 14031 loop=UGLY_LOOP elapsed_ns=9201 = 14031 loop=UGLY_LOOP elapsed_ns=8826 = 14031 loop=UGLY_LOOP elapsed_ns=8859 = 14031 loop=UGLY_LOOP elapsed_ns=9582 = 14031

Thus, the mileage may vary, and if the architecture is not known exactly, I would just become simpler

Why does my code cause command cache misses?

Update

More articles: