The main observation is that some swaps can be executed in parallel. To quote your post:
- we first swap every two bits and perform 4 swaps,
- then we replace 2-bit pairs, so we have 2 swaps
- and then we combine everything together in the last swap.
Total: 7 swaps, log (N) - 3.
That's right, it has 7 swaps, but in 3 steps described above.
For example, replacing 4 pairs looks like x = ((x & 01010101b) << 1) | ((x >> 1) & 01010101b) x = ((x & 01010101b) << 1) | ((x >> 1) & 01010101b) . Thus, we take bits 0, 2, 4 and 6 and advance them at positions 1, 3, 5 and 7 (the left half) and at the same time take bits 1, 3, 5 and 7 and lower them to positions 0, 2, 4 and 6, respectively (right half).
source share