Here is a monotonous counter. I'm not sure if you can call it simple, though.
Assuming both ONE and ZERO always in the register, then this should compile 5 instructions. (7 or 8 if VEX encoding is not used)
inline __m128i nextc(__m128i x){ const __m128i ONE = _mm_setr_epi32(1,0,0,0); const __m128i ZERO = _mm_setzero_si128(); x = _mm_add_epi64(x,ONE); __m128i t = _mm_cmpeq_epi64(x,ZERO); t = _mm_and_si128(t,ONE); t = _mm_unpacklo_epi64(ZERO,t); x = _mm_add_epi64(x,t); return x; }
Test Code (MSVC):
int main() { __m128i x = _mm_setr_epi32(0xfffffffa,0xffffffff,1,0); int c = 0; while (c++ < 10){ cout << x.m128i_u64[0] << " " << x.m128i_u64[1] << endl; x = nextc(x); } return 0; }
Output:
18446744073709551610 1 18446744073709551611 1 18446744073709551612 1 18446744073709551613 1 18446744073709551614 1 18446744073709551615 1 0 2 1 2 2 2 3 2
A slightly better version suggested by @Norbert P. It saves 1 instruction according to my original decision.
inline __m128i nextc(__m128i x){ const __m128i ONE = _mm_setr_epi32(1,0,0,0); const __m128i ZERO = _mm_setzero_si128(); x = _mm_add_epi64(x,ONE); __m128i t = _mm_cmpeq_epi64(x,ZERO); t = _mm_unpacklo_epi64(ZERO,t); x = _mm_sub_epi64(x,t); return x; }
Mysticial
source share