replace the main loop as follows:
const int quick_len=len/8; const uint16_t * const the_end=b16+quick_len*4; len -= quick_len*8; for (; b16+4 <= the_end; b16+=4) { a += b16[0]; b += b16[1]; c += b16[2]; d += b16[3]; }
It seems there is no need to manually unroll the loop if you use -O3
In addition, the test case allowed for too much optimization, since the input was static and the results were not used, as well as printing the result helps to verify that the optimized versions do not violate anything
The full test that I used:
int main(int argc, char *argv[]) { using namespace std::chrono; auto start_time = steady_clock::now(); int ret=OnesComplementSum((const uint8_t*)(s.data()+argc), s.size()-argc, 0); auto elapsed_ns = duration_cast<nanoseconds>(steady_clock::now() - start_time).count(); std::cout << "loop=" << loop << " elapsed_ns=" << elapsed_ns << " = " << ret<< std::endl; return ret; }
Comparison with theis ( CLEAN LOOP ) and an improved version ( UGLY LOOP ) and a longer test line:
loop=CLEAN_LOOP elapsed_ns=8365 = 14031 loop=CLEAN_LOOP elapsed_ns=5793 = 14031 loop=CLEAN_LOOP elapsed_ns=5623 = 14031 loop=CLEAN_LOOP elapsed_ns=5585 = 14031 loop=UGLY_LOOP elapsed_ns=9365 = 14031 loop=UGLY_LOOP elapsed_ns=8957 = 14031 loop=UGLY_LOOP elapsed_ns=8877 = 14031 loop=UGLY_LOOP elapsed_ns=8873 = 14031
Check here: http://coliru.stacked-crooked.com/a/52d670039de17943
EDIT:
In fact, the entire function can be reduced to:
uint32_t OnesComplementSum(const uint8_t* inData, int len, uint32_t sum) { const uint16_t * b16 = reinterpret_cast<const uint16_t *>(inData); const uint16_t * const the_end=b16+len/2; for (; b16 < the_end; ++b16) { sum += *b16; } sum = (sum & uint16_t(-1)) + (sum >> 16); return (sum > uint16_t(-1)) ? sum - uint16_t(-1) : sum; }
Which is better than OP with -O3, but worse with -O2:
http://coliru.stacked-crooked.com/a/bcca1e94c2f394c7
loop=CLEAN_LOOP elapsed_ns=5825 = 14031 loop=CLEAN_LOOP elapsed_ns=5717 = 14031 loop=CLEAN_LOOP elapsed_ns=5681 = 14031 loop=CLEAN_LOOP elapsed_ns=5646 = 14031 loop=UGLY_LOOP elapsed_ns=9201 = 14031 loop=UGLY_LOOP elapsed_ns=8826 = 14031 loop=UGLY_LOOP elapsed_ns=8859 = 14031 loop=UGLY_LOOP elapsed_ns=9582 = 14031
Thus, the mileage may vary, and if the architecture is not known exactly, I would just become simpler