Probably before the behavior of the processor cache (by 12 MB, your images far exceed the 256 KB L2 cache in the ARM Cortex A8, which is inside iphone3gs).
The first example accesses the read array in sequential order, which is fast, but it has to access the write array out of order, which is slow.
The second example - the opposite - the writing array is written in fast, sequential order, and the reading array is accessed more slowly. Write errors are obviously less expensive under this workload than reading misses.
Ulrich Drapper's article What every programmer needs to know about memory , it is recommended to read if you want to know more about it.
Note that if you include this operation in a function, you will help the optimizer generate more efficient code if you use the restrict qualifier for pointer arguments, for example:
void reorder(uint32_t restrict *buffer1, uint32_t restrict *buffer2) { int i = 0; for (int x = 0; x < width; x++) for (int y = 0; y < height; y++) buffer1[x+y*width] = buffer2[i++]; }
( restrict qualifier promises the compiler that the data pointed to by two pointers does not overlap - which in this case requires the function to make sense anyway).
source share