Interpretation of the result of primary statistics

Question

Interpretation of the result of primary statistics

I developed code that introduces a large two-dimensional image (up to 64MPixels) and:

Applies filters to each row
Transmits an image (locking is used to avoid a lot of cache misses)
Applies filters on columns (now-rows) of an image
Passes the filtered image back to continue with other calculations.

Although this does not change anything, to complete my question, filtering uses a discrete wavelet transform, and the code is written in C.

My ultimate goal is to make this run as fast as possible. The acceleration that I still have is more than 10 times due to the use of a matrix of locking, transposition, vectorization, multithreading, a compiler, etc.

Coming to my question: recent problems with profiling the code I used (using perf stat -e ) bothered me.

  76,321,873 cache-references 8,647,026,694 cycles # 0.000 GHz 7,050,257,995 instructions # 0.82 insns per cycle 49,739,417 cache-misses # 65.171 % of all cache refs 0.910437338 seconds time elapsed

(# of cache-misses) / (# instructions) is around 0.7%. It is mentioned here that this number is a good thing to keep in mind the memory efficiency check.

On the other hand, the% of omissions of cache links to cache memory is much higher (65%!), Which, as I see, may indicate that something is wrong with execution in terms of cache efficiency.

Detailed stat from perf stat -d :

  2711.191150 task-clock # 2.978 CPUs utilized 1,421 context-switches # 0.524 K/sec 50 cpu-migrations # 0.018 K/sec 362,533 page-faults # 0.134 M/sec 8,518,897,738 cycles # 3.142 GHz [40.13%] 6,089,067,266 stalled-cycles-frontend # 71.48% frontend cycles idle [39.76%] 4,419,565,197 stalled-cycles-backend # 51.88% backend cycles idle [39.37%] 7,095,514,317 instructions # 0.83 insns per cycle # 0.86 stalled cycles per insn [49.66%] 858,812,708 branches # 316.766 M/sec [49.77%] 3,282,725 branch-misses # 0.38% of all branches [50.19%] 1,899,797,603 L1-dcache-loads # 700.724 M/sec [50.66%] 153,927,756 L1-dcache-load-misses # 8.10% of all L1-dcache hits [50.94%] 45,287,408 LLC-loads # 16.704 M/sec [40.70%] 26,011,069 LLC-load-misses # 57.44% of all LL-cache hits [40.45%] 0.910380914 seconds time elapsed

Closed loop and backend loops are common here, and lower-level caches seem to suffer from a high throughput of 57.5%.

Which indicator is most suitable for this scenario? One of the ideas that I thought about is that it could be that the workload no longer requires further “touching” LL caches after loading the original image (it loads the values once and after that it is done - the workload is more connected with a processor than associated with memory, which is an image filtering algorithm).

The machine on which I run this function is the Xeon E5-2680 (20 MB smart cache, from which the second level cache is 256 KB per core, 8 cores).

+5

performance optimization caching computer-architecture perf

koukouviou Mar 14 '15 at 7:12

source share

1 answer

VAndrei · Accepted Answer · 2015-03-14T21:43:33+0000

The first thing you want to make sure is that no other intensive computing process is running on your computer. This is a server processor, so I thought it might be a problem.

If you use multithreading in your program and distribute an equal amount of work between threads, you may be interested in collecting metrics on only one CPU .

I suggest disabling hyperthreading during the optimization phase, as this can lead to confusion when interpreting profiling metrics. (e.g. increasing # cycles held in the background). In addition, if you extend the work to 3 threads, you have a high probability that 2 threads will share the resources of one core, and the third will have the entire core for itself - and it will be faster.

Perf never explained metrics very well. Judging by the order of magnitude, cache references are L2 misses that hit the LLC. A high LLC skip number compared to LLC links is not always bad if the number of LLC / # Instructions links is low. In your case, you have 0.018, so this means that most of your data is used from L2. The high transmittance of LLC means that you still need to get data from RAM and write it back.

As for #Cycles BE and FE, I'm a little worried about the values because they don't add up to 100% of the total number of loops. You have 8G, but they remain 6G cycles in FE and 4G cycles in BE. This does not seem right.

If the FE loop is high, it means that you have misses in the instruction cache or the wrong branch. If the BE cycle is high, it means that you are waiting for data.

In any case, regarding your question. The most relevant metric for evaluating the performance of your code is Instructions / Loop (IPC) . Your processor can execute up to 4 instructions / cycles. You are only performing 0.8. This means that resources are underutilized, unless you have a lot of vector instructions. After the IPC, you need to check the branch skips and L1 misses (data and code), since they generate most fines.

Last suggestion: you might be interested in trying Intel vTune Amplifier. This gives a much better explanation of the metrics and points out possible problems in your code.

Interpretation of the result of primary statistics

More articles: