First, I believe that tab is a large 2D array, such as a static array (for example, int tab[1024*1024][1024*1024] ) or a dynamically allocated array (for example, int** tab and the following malloc s). Here you want to pre-extract some data from the tab into the cache to reduce execution time.
I just don’t think you need to manually insert any prefetch into your code, where a simple reduction for a 2D array is performed. Modern processors will automatically prefetch if necessary and profitable.
Two facts you should know about this issue:
(1) You already use the spatial locality of the tab inside the innermost loop. After reading tab[i][0] (after a miss in the cache or a page error), the data from tab[i][0] in tab[i][15] will be in the processor caches, assuming that the size of the cache line is 64 bytes.
(2) However, when the code moves in a line, i.e. from tab[i][M-1] to tab[i+1][0] , it is very likely that a cold will occur in the cold cache, especially when tab is a dynamically allocated array in which each line can be distributed in a fragmented way. However, if the array is statically distributed, each row will be adjacent in memory.
So, prefetching makes sense only when you read (1) the first element of the next line and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.
You can do this by inserting a prefetch operation (e.g. __builtin_prefetch ) in the top loop. However, modern compilers cannot always issue such prefetch commands. If you really want to do this, you should check the generated binary.
However, as I said, I do not recommend you to do this, because modern processors will basically prefetch automatically, and automatic prefetch will basically exceed your manual code. For example, an Intel processor, such as Ivy Bridge processors, has several pre-sets of data, such as prefetching to L1, L2, or L3 cache. (I do not think that mobile processors have a preliminary data set). Some prefeters will load adjacent cache lines.
If you are doing more expensive computations on large 2D arrays, there are many alternative algorithms that are more cache friendly. A notable example would be a locked (named) matrix to multiply . Naive matrix multiplication suffers from a large number of cache misses, but a blocked algorithm significantly reduces cache misses by calculating small subsets suitable for caches. See some links, such as this .