How does matrix separation in blocks improve caching?

I am trying to understand the concepts of caches and blocking in a matrix. I am trying to transform a matrix.

I understand the concept of a memory layout in rows, so I get that when I try to access data in a row, I get fewer cache misses compared to columns.

for( int i = 0; i < n; i++ ) for( int j = 0; j < n; j++ ) destination[j+i*n] = source[i+j*n]; 

So, for the original matrix, I will have fewer misses in the cache, and for the destination, I will have more misses in the cache.


Here is the lock code

 for (int i = 0; i < n; i += blocksize) { for (int j = 0; j < n; j += blocksize) { // transpose the block beginning at [i,j] for (int k = i; k < i + blocksize; ++k) { for (int l = j; l < j + blocksize; ++l) { dst[k + l*n] = src[l + k*n]; } } } } 

The code above uses a locking technique. I can't figure out how to block help with performance?

+7
caching matrix
source share
2 answers

Yes, you will have more cache hits on one side than on the other.

The trick, however, is to break it into small enough parts so that they can be "reused" in processing.

Matrices

eg. in the above example, we will have 1 cache miss on the src matrix and 4 on the dst size (I chose the size of the cache line of 4 elements and the block size of 4 elements, but this is just a coincidence).

If the cache size is more than 5 lines, when processing the line, we will not have more misses.

If the cache size is less than this, there will be more misses, because the lines will squeeze each other out. In this case, src will remain in the cache as more used, and dst will be discarded, which will give us 16 misses on the dst side. 5 looks better than 17 :)

Thus, controlling the size of the block is quite low, we can reduce the frequency of skipping the cache.

+2
source share

The most important aspect here is locality , in particular for your example, spatial locality . Of course, this means access to data close to each other.

Whenever one matrix element is accented, and it is not yet in the cache, it always loads the entire row. This cache line may contain, for example, 64 bytes or 16 integer elements. If you do not use other elements while in the cache, your code is inefficient.

A cache line will be drawn if you load too much data between accesses. Consider the first access from the inner loop that source[i] loads. The next element in the source[i + 1] memory is on the same cache line and is already loaded. However, this will only be necessary after completing the entire inner loop. If the inner loop gets too much data, source[i + 1] will no longer be in the cache. The distance between access to source[i] and source[i+1] too great.

Blocking eliminates this inefficiency. The innermost cycle is much smaller than originally, the distance between access is reduced. All data processed by the innermost contour is cached.

0
source share

All Articles