Why is there a significant difference in this C ++ loop runtime?

Question

Why is there a significant difference in this C ++ loop runtime?

I went through loops and found a significant difference in access to loops. I can’t understand what is the reason for this difference in both cases?

First example:

Lead time; 8 seconds

for (int kk = 0; kk < 1000; kk++) { sum = 0; for (int i = 0; i < 1024; i++) for (int j = 0; j < 1024; j++) { sum += matrix[i][j]; } }

Second example:

Lead time: 23 seconds

 for (int kk = 0; kk < 1000; kk++) { sum = 0; for (int i = 0; i < 1024; i++) for (int j = 0; j < 1024; j++) { sum += matrix[j][i]; } }

Which causes so much difference in runtime just by exchanging

 matrix[i][j]

to

 matrix[j][i]

?

+54

c ++ performance nested-loops

Massab Oct 27 '14 at 8:28

source share

8 answers

The difference in performance is due to the caching strategy of the computer.

The 2-dimensional array matrix[i][j] is presented as a long list of values in memory.

For example, array A[3][4] looks like this:

 1 1 1 1 2 2 2 2 3 3 3 3

In this example, each record A [0] [x] is set to 1, each record A [1] [x] is set to 2, ...

If your first loop applies to this matrix, the access order is this:

 1 2 3 4 5 6 7 8 9 10 11 12

So far, the second order of access to the loops is as follows:

 1 4 7 10 2 5 8 11 3 6 9 12

When a program accesses an element of an array, it also loads subsequent elements.

eg. if you access A[0][1] , A[0][2] and A[0][3] also loaded.

Thus, the first loop should perform fewer load operations, as some items are already in the cache when necessary. The second cycle loads the entries into the cache, which at this time are not needed, which leads to more loading operations.

+53

H4kor Oct 27 '14 at 8:58

source share

Other people did a good job explaining why one form of code uses the memory cache more efficiently than the other. I would like to add some background information that you may not be aware of: you probably do not understand how expensive access to main memory is currently.

The numbers posted on this question seem to be suitable for me, and I will reproduce them here because they are so important:

 Core i7 Xeon 5500 Series Data Source Latency (approximate) L1 CACHE hit, ~4 cycles L2 CACHE hit, ~10 cycles L3 CACHE hit, line unshared ~40 cycles L3 CACHE hit, shared line in another core ~65 cycles L3 CACHE hit, modified in another core ~75 cycles remote remote L3 CACHE ~100-300 cycles Local Dram ~60 ns Remote Dram ~100 ns

Note the change in units for the last two entries. Depending on what model you have, this processor runs at 2.9-3.2 GHz; to make math easier, let’s just name it 3 GHz. Thus, one cycle is 0.333333 nanoseconds. Thus, DRAM access is also 100-300 cycles.

The fact is that the processor could execute hundreds of instructions in the time taken to read one cache line from the main memory. This is called memory . Because of this, efficient use of the memory cache is more important than any other factor in overall performance on modern processors.

+34

zwol Oct 27 '14 at 15:14

source share

The answer depends a bit on how matrix defined. In a fully dynamically allocated array you will have:

 T **matrix; matrix = new T*[n]; for(i = 0; i < n; i++) { t[i] = new T[m]; }

So, for each matrix[j] a new memory search is required for the pointer. If you execute j loop outside, the inner loop can reuse the pointer for matrix[j] for the entire inner loop.

If the matrix is a simple two-dimensional array:

 T matrix[n][m];

then matrix[j] will be just multiplying by 1024 * sizeof(T) - this can be done by adding 1024 * sizeof(T) the loop index to the optimized code, so it should be relatively fast anyway.

In addition, we have cache localization factors. Caches have "lines" of data, which typically range from 32 to 128 bytes per line. Therefore, if your code reads address X , the cache loads with values from 32 to 128 bytes around X Thus, if you only need NEXT, it is only sizeof(T) move from the current location, it is most likely already in the cache [and modern processors also detect that you bypass in a loop reading each memory cell, and preload data].

In the case of inner loop j you read the new distance location sizeof(T)*1024 for each loop [or perhaps a larger distance if it is dynamically allocated]. This means that the loaded data will not be useful for the next cycle, because it is not in the next 32 - 128 bytes.

And finally, it is possible that the first cycle is more optimized due to SSE instructions or similar, which allows you to perform calculations even faster. But this is probably marginal for such a large matrix, since performance is severely limited by memory at that size.

+18

Mats Petersson Oct. 27 '14 at 8:55

source share

The memory hardware is not optimized for delivering individual addresses: instead, it tends to work with large chunks of contiguous memory called cache lines. Each time you read one record of your matrix, the entire cache line in which it is located is also loaded into the cache along with it.

Faster loop ordering is configured to read memory in order; every time you load a cache line, you use all the entries in that cache line. Each pass through the outer loop, you read each matrix entry only once.

However, slow loop ordering uses only one entry from each cache line before moving on. Thus, each cache line must be loaded several times, once for each matrix entry in the line. for example, if a double is 8 bytes, and the cache length is 64 bytes, then each pass through the outer loop should read each matrix entry eight times, and not once.

All that said, if you turned on optimization, you probably won’t notice the difference: optimizers understand this phenomenon, and good ones can recognize that they can change which cycle is an internal cycle and which cycle is an external cycle for this particular fragment code.

(also a good optimizer would only perform one pass through the outermost loop, since it recognizes the first 999 times, not related to the final sum value)

+10

Hurkyl Oct 28 '14 at 1:15

source share

The matrix is stored in memory as a vector. Having accessed the first path, it accesses the memory sequentially. To access it, the second method requires a transition to memory locations. See http://en.wikipedia.org/wiki/Row-major_order

+8

James Oct 27 '14 at 8:36

source share

If you access j - i, the size j is cached, so the machine code should not change it every time, the second dimension is not cached, so you actually delete the cache every time, which causes a difference.

+5

Etixpp Oct 27 '14 at 8:36

source share

Based on the concept of link locality, it is very likely that a piece of code will have access to neighboring memory cells. Thus, more values are loaded into the cache than requested. This means more cache hits. Your first example satisfies this well, and the second is code.

+3

chandra_cst Oct 30 '14 at 4:34

source share

Peixu Zhu · Accepted Answer · 2014-10-27 08:33

This is a memory cache problem.

matrix[i][j] has better cache requests than matrix[j][i] , since matrix[i][j] has more memory access capabilities.

For example, when we access matrix[i][0] , the cache can load a contiguous segment of memory containing matrix[i][0] , so access to matrix[i][1] , matrix[i][2] .. will be useful at caching speed, since matrix[i][1] , matrix[i][2] , ... are next to matrix[i][0] .

However, when we refer to matrix[j][0] , it is far from matrix[j - 1][0] and may not be cached and cannot use the caching speed. In particular, a matrix is usually stored as a continuous large segment of memory, and cacher can predict the behavior of memory access and always cache memory.

This is why matrix[i][j] is faster. This is typical for cache-based performance optimization.

Why is there a significant difference in this C ++ loop runtime?

More articles: