The answer depends a bit on how matrix defined. In a fully dynamically allocated array you will have:
T **matrix; matrix = new T*[n]; for(i = 0; i < n; i++) { t[i] = new T[m]; }
So, for each matrix[j] a new memory search is required for the pointer. If you execute j loop outside, the inner loop can reuse the pointer for matrix[j] for the entire inner loop.
If the matrix is a simple two-dimensional array:
T matrix[n][m];
then matrix[j] will be just multiplying by 1024 * sizeof(T) - this can be done by adding 1024 * sizeof(T) the loop index to the optimized code, so it should be relatively fast anyway.
In addition, we have cache localization factors. Caches have "lines" of data, which typically range from 32 to 128 bytes per line. Therefore, if your code reads address X , the cache loads with values from 32 to 128 bytes around X Thus, if you only need NEXT, it is only sizeof(T) move from the current location, it is most likely already in the cache [and modern processors also detect that you bypass in a loop reading each memory cell, and preload data].
In the case of inner loop j you read the new distance location sizeof(T)*1024 for each loop [or perhaps a larger distance if it is dynamically allocated]. This means that the loaded data will not be useful for the next cycle, because it is not in the next 32 - 128 bytes.
And finally, it is possible that the first cycle is more optimized due to SSE instructions or similar, which allows you to perform calculations even faster. But this is probably marginal for such a large matrix, since performance is severely limited by memory at that size.
Mats Petersson Oct. 27 '14 at 8:55 2014-10-27 08:55
source share