Why should we speculate? We can try and find out. I compiled the code with gcc -O3 -g (on x86) and parsed the result. There were more changes than I expected, so I will focus on the bit in the middle, where we expect that most of the differences between them will be. The core of the loop in the first case:
0x00000030 <foo+48>: mov %dl,(%edi,%esi,1) 0x00000033 <foo+51>: movzbl 0x1(%ecx),%edx 0x00000037 <foo+55>: inc %eax 0x00000038 <foo+56>: inc %ecx 0x00000039 <foo+57>: mov %eax,%esi 0x0000003b <foo+59>: test %dl,%dl 0x0000003d <foo+61>: jne 0x30 <foo+48>
The core of the loop in the second case:
0x00000080 <foo2+48>: mov %dl,(%eax) 0x00000082 <foo2+50>: movzbl 0x1(%ecx),%edx 0x00000086 <foo2+54>: inc %eax 0x00000087 <foo2+55>: inc %ecx 0x00000088 <foo2+56>: test %dl,%dl 0x0000008a <foo2+58>: jne 0x80 <foo2+48>
On this basis, the second, perhaps a little faster. But in reality it will not be of much importance in practice. In L1 cache, both loops are just perfect, and the target memory is not available, so the differences will be debatable. Good luck, ever actually measuring the difference between the two.
Donal fellows
source share