I developed code that introduces a large two-dimensional image (up to 64MPixels) and:
- Applies filters to each row
- Transmits an image (locking is used to avoid a lot of cache misses)
- Applies filters on columns (now-rows) of an image
- Passes the filtered image back to continue with other calculations.
Although this does not change anything, to complete my question, filtering uses a discrete wavelet transform, and the code is written in C.
My ultimate goal is to make this run as fast as possible. The acceleration that I still have is more than 10 times due to the use of a matrix of locking, transposition, vectorization, multithreading, a compiler, etc.
Coming to my question: recent problems with profiling the code I used (using perf stat -e ) bothered me.
76,321,873 cache-references 8,647,026,694 cycles # 0.000 GHz 7,050,257,995 instructions # 0.82 insns per cycle 49,739,417 cache-misses # 65.171 % of all cache refs 0.910437338 seconds time elapsed
(# of cache-misses) / (# instructions) is around 0.7%. It is mentioned here that this number is a good thing to keep in mind the memory efficiency check.
On the other hand, the% of omissions of cache links to cache memory is much higher (65%!), Which, as I see, may indicate that something is wrong with execution in terms of cache efficiency.
Detailed stat from perf stat -d :
2711.191150 task-clock
Closed loop and backend loops are common here, and lower-level caches seem to suffer from a high throughput of 57.5%.
Which indicator is most suitable for this scenario? One of the ideas that I thought about is that it could be that the workload no longer requires further โtouchingโ LL caches after loading the original image (it loads the values โโonce and after that it is done - the workload is more connected with a processor than associated with memory, which is an image filtering algorithm).
The machine on which I run this function is the Xeon E5-2680 (20 MB smart cache, from which the second level cache is 256 KB per core, 8 cores).