C ++ How to force cache prefetch data? (array loop)

I have a cycle like

start = __rdtsc(); unsigned long long count = 0; for(int i = 0; i < N; i++) for(int j = 0; j < M; j++) count += tab[i][j]; stop = __rdtsc(); time = (stop - start) * 1/3; 

You need to check how prefetch data affects performance. How to force a prefetch of some values ​​from memory to cache before they are taken into account?

+6
source share
3 answers

First, I believe that tab is a large 2D array, such as a static array (for example, int tab[1024*1024][1024*1024] ) or a dynamically allocated array (for example, int** tab and the following malloc s). Here you want to pre-extract some data from the tab into the cache to reduce execution time.

I just don’t think you need to manually insert any prefetch into your code, where a simple reduction for a 2D array is performed. Modern processors will automatically prefetch if necessary and profitable.


Two facts you should know about this issue:

(1) You already use the spatial locality of the tab inside the innermost loop. After reading tab[i][0] (after a miss in the cache or a page error), the data from tab[i][0] in tab[i][15] will be in the processor caches, assuming that the size of the cache line is 64 bytes.

(2) However, when the code moves in a line, i.e. from tab[i][M-1] to tab[i+1][0] , it is very likely that a cold will occur in the cold cache, especially when tab is a dynamically allocated array in which each line can be distributed in a fragmented way. However, if the array is statically distributed, each row will be adjacent in memory.


So, prefetching makes sense only when you read (1) the first element of the next line and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.

You can do this by inserting a prefetch operation (e.g. __builtin_prefetch ) in the top loop. However, modern compilers cannot always issue such prefetch commands. If you really want to do this, you should check the generated binary.

However, as I said, I do not recommend you to do this, because modern processors will basically prefetch automatically, and automatic prefetch will basically exceed your manual code. For example, an Intel processor, such as Ivy Bridge processors, has several pre-sets of data, such as prefetching to L1, L2, or L3 cache. (I do not think that mobile processors have a preliminary data set). Some prefeters will load adjacent cache lines.


If you are doing more expensive computations on large 2D arrays, there are many alternative algorithms that are more cache friendly. A notable example would be a locked (named) matrix to multiply . Naive matrix multiplication suffers from a large number of cache misses, but a blocked algorithm significantly reduces cache misses by calculating small subsets suitable for caches. See some links, such as this .

+3
source

For GCC only:

 __builtin_prefetch((const void*)(prefetch_address),0,0); 

prefetch_address may not be valid, there will be no segfault. If there is too little difference between prefetch_address and the current location, then there can be no effect or even slowdown. Try to set it at least 1k ahead.

+7
source

The easiest / most portable method is to simply read some data in each part of the cache. Assuming the tab is a valid two-dimensional array, you can:

 char *tptr = (char *)&tab[0][0]; tptr += 64; char temp; volatile char keep_temp_alive; for(int i = 0; i < N; i++) { temp += *tptr; tptr += 64; for(j = 0; j < M; j++) count += tab[i][j]; } keep_temp_alive = temp; 

Something like that. However, this depends on: 1. You will not finish reading outside of the allocated memory [too much]. 2. The J loop is not much larger than 64 bits. If so, you can add additional steps temp += *tptr; tptr += 64; temp += *tptr; tptr += 64; at the beginning of the cycle.

keep_temp_alive after the loop is necessary so that the compiler cannot completely remove the temporary parameters as unnecessary loads.

Unfortunately, I am writing too slowly generalized code to offer built-in instructions, points for this are for Leonid.

+3
source

All Articles