How to create processor cache effect in C and java?

In an article by Ulrich Drapper What every programmer should know about memory , Part 3: CPU caches, it shows a graph that shows the relationship between the size of the "working set" and the CPU cycle consumed per operation (in this case, sequential reading). And on the graph there are two leaps that indicate the size of the L1 cache and the L2 cache. I wrote my own program to reproduce the effect in c. It just sequentially reads the int [] array from head to tail, and I tried a different array size (from 1 KB to 1 MB). I am building data on a graph, and there is no jump, the graph is a straight line.

My questions:

  1. Is there something wrong with my method? How to create the effect of the processor cache (to see the jumps).
  2. I thought if this is a sequential read, then it should work like this: When reading the first element, it is not in the cache, and there will be hits within the size of the cache line (64 KB). By prefetching, the reading delay of the next cache line will be hidden. It will continuously read data into the L1 cache, even if the size of the working set exceeds the size of the L1 cache, it will exclude the least used recently and continue prefetching. Thus, most errors in the cache will be hidden, and the time taken to fetch data from L2 will be hidden behind reading, that is, they work simultaneously. associativity (in my case, 8 ways) hides the delay in reading data from L2. So, the phenomenon of my program must be correct, am I missing something?
  3. Is it possible to get the same effect in Java?

By the way, I am doing this on Linux.


Change 1

Thanks for Steven C.'s suggestion. Here's some additional info: This is my code:

int *arrayInt; void initInt(long len) { int i; arrayInt = (int *)malloc(len * sizeof(int)); memset(arrayInt, 0, len * sizeof(int)); } long sreadInt(long len) { int sum = 0; struct timespec tsStart, tsEnd; initInt(len); clock_gettime(CLOCK_REALTIME, &tsStart); for(i = 0; i < len; i++) { sum += arrayInt[i]; } clock_gettime(CLOCK_REALTIME, &tsEnd); free(arrayInt); return (tsEnd.tv_nsec - tsStart.tv_nsec) / len; } 

In the main () function, I tried from 1 KB to 100 MB of array size, all the same, the average time spent on an element is 2 nanoseconds. I think this is access time to L1d.

My cache size:

L1d == 32 thousand

L2 == 256 thousand

L3 == 6144 KB


UPDATE 2

I changed my code to use a linked list.

 // element type struct l { struct l *n; long int pad[NPAD]; // the NPAD could be changed, in my case I set it to 1 }; struct l *array; long globalSum; // for init the array void init(long len) { long i, j; struct l *ptr; array = (struct l*)malloc(sizeof(struct l)); ptr = array; for(j = 0; j < NPAD; j++) { ptr->pad[j] = j; } ptr->n = NULL; for(i = 1; i < len; i++) { ptr->n = (struct l*)malloc(sizeof(struct l)); ptr = ptr->n; for(j = 0; j < NPAD; j++) { ptr->pad[j] = i + j; } ptr->n = NULL; } } // for free the array when operation is done void release() { struct l *ptr = array; struct l *tmp = NULL; while(ptr) { tmp = ptr; ptr = ptr->n; free(tmp); } } double sread(long len) { int i; long sum = 0; struct l *ptr; struct timespec tsStart, tsEnd; init(len); ptr = array; clock_gettime(CLOCK_REALTIME, &tsStart); while(ptr) { for(i = 0; i < NPAD; i++) { sum += ptr->pad[i]; } ptr = ptr->n; } clock_gettime(CLOCK_REALTIME, &tsEnd); release(); globalSum += sum; return (double)(tsEnd.tv_nsec - tsStart.tv_nsec) / (double)len; } 

Finally, I will print globalSum to avoid compiler optimization. As you can see, this is still sequential reading, I even tried up to 500 MB of array size, the average time per element is about 4 nanoseconds (perhaps because it needs to access the data panel and pointer) n ', two accesses ), the same as 1 KB of array size. So, I think this is because cache optimization, such as prefetching, really hides the delay, am I right? I will try random access and publish the results later.


UPDATE 3

I tried random access to a linked list, here is the result: randomly access a linked list

the first red line is my L1 cache size, the second is L2. So we can see a little jump there. And sometimes the wait time is still well hidden.

+8
source share
5 answers

This answer is not an answer, but rather a collection of notes.

First, the CPU seeks to work with cache lines, rather than with individual bytes / words / words. This means that if you sequentially read / write an array of integers, then the first access to the cache line may cause the cache to disappear, but subsequent access to different integers in the same cache line will not. For 64-byte cache lines and 4-byte integers, this means that you get a cache miss only once for every 16 hits; which will dilute the results.

Secondly, the processor has a “hardware prefetch”. If it detects that cache lines are being read sequentially, the hardware preselector will automatically pre-extract the cache lines, which he predicts will be needed in the following (in an attempt to retrieve them in the cache before they are needed).

Third, the CPU performs other actions (for example, “out of order”) to hide the cost of the sample. You can measure the time difference (between getting into the cache and skipping the cache) - this is the time that the processor could not hide, and not the total cost of the sample.

These three things together mean that; for sequentially reading an array of integers, it is likely that the CPU pre-selects the next cache line while you perform 16 read operations from the previous cache line; and any costs for skipping the cache will not be noticeable and can be completely hidden. To prevent this; you would need to “randomly” access each cache line once to maximize the performance difference measured between “working set in cache / s” and “working set not in cache / s”.

Finally, there are other factors that can influence measurements. For example, for an OS that uses paging (for example, Linux and almost all other modern OSs), there is a whole level of caching besides all this (TLBs / Translation Look-asffide Buffers), and TLB is skipped when the working set goes beyond a certain size .; which should be visible as the fourth "step" on the chart. There is also interference from the kernel (IRQ, page crashes, task switches, multiple processors, etc.); which can be seen as a random static error on the graph (if the tests are not repeated often and outliers are rejected). There are also cache structure artifacts (cache associativity) that can reduce cache efficiency in ways that depend on the physical addresses / addresses allocated by the kernel; what can be considered as "steps" on the chart, shifting to different places.

+5
source

Is there something wrong with my method?

Perhaps, but without your actual code, which cannot be answered.

  • Your description of what your code does does not say whether you read the array one or more times.

  • The array may not be large enough ... depending on your hardware. (Don't some modern chips have a third-level cache of several megabytes?)

  • In the case of Java, in particular, you have to do many things in the right way to implement a meaningful micro-test.


In case of C:

  • You can try tuning the C compiler optimization switches.

  • Since your code accesses the array one at a time, the compiler can order instructions so that the processor can keep up with the times, or the processor can optimize prefetching or make large selections. You can try reading array elements in a less predictable order.

  • It is even possible that the compiler has completely optimized the loop because the result of the loop calculation is not used for anything.

(According to this Q & A - How long does it take to extract a single word from memory ? , fetching from L2 cache is ~ 7 nanoseconds and fetch from main memory is ~ 100 nanoseconds, but you get ~ 2 nanoseconds. Something smart should happen here to make it work as fast as you watch.)

+3
source

With gcc-4.7 and compilation with gcc -std=c99 -O2 -S -D_GNU_SOURCE -fverbose-asm tcache.c you can see that the compiler is optimized enough to remove the for loop (because sum not used).

I had to improve the source code; some #include -s are missing, and i not declared in the second function, so your example does not even compile as it is.

Make sum global variable or pass it somehow to the caller (possibly with a global int globalsum; and setting globalsum=sum; after the loop).

And I'm not sure that you are right to clear the array using memset . I could imagine a smart enough compiler that understands that you sum all zeros.

Finally, your code has extremely regular behavior with good locality: from time to time there are misses in the cache, the entire cache line is loaded, and the data is good enough for many iterations. Some smart optimizations (e.g. -O3 or better) can generate good prefetch instructions. This is optimal for caches because for a 32-line L1 cache line, a cache miss occurs every 32 cycles, so it is well amortized.

Creating a linked list of data will make cache behavior worse. Conversely, in some real-world programs, carefully adding __ builtin_prefetch in a few well-chosen locations can improve performance by more than 10% (but adding too many of them will decrease performance).

In real life, the processor spends most of the time waiting for some cache (and it is difficult to measure, this wait is processor time, not downtime). Remember that when you miss the L3 cache, the time it takes to load data from your RAM module is the time it takes to execute hundreds of machine instructions!

+1
source

I can’t say for sure about 1 and 2, but it would be more difficult to successfully run such a test in Java. In particular, I may be concerned that managed language features, such as automatic garbage collection, may occur in the middle of your testing and reset your results.

0
source

As you can see from graph 3.26, Intel Core 2 practically does not see jumps while reading (red line at the top of the graph). He writes / copies where the jumps are clearly visible. Better to do a recording test.

0
source

All Articles