Yes, you will have more cache hits on one side than on the other.
The trick, however, is to break it into small enough parts so that they can be "reused" in processing.

eg. in the above example, we will have 1 cache miss on the src matrix and 4 on the dst size (I chose the size of the cache line of 4 elements and the block size of 4 elements, but this is just a coincidence).
If the cache size is more than 5 lines, when processing the line, we will not have more misses.
If the cache size is less than this, there will be more misses, because the lines will squeeze each other out. In this case, src will remain in the cache as more used, and dst will be discarded, which will give us 16 misses on the dst side. 5 looks better than 17 :)
Thus, controlling the size of the block is quite low, we can reduce the frequency of skipping the cache.
Ivan
source share