Improved Skylake L2 cache due to reduced associativity?

Question

Improved Skylake L2 cache due to reduced associativity?

The Intel Optimization Guide , section 2.1.3, lists a number of enhancements for the caches and memory subsystem in Skylake (my selection):

The Skylake Microarchitecture cache hierarchy has the following Improvements:
Higher cache bandwidth compared to previous generations.
Simultaneous processing of more downloads and storage provided by extended buffers.
A processor can perform two page walks in parallel compared to one in Haswell microarchitecture and earlier generations.
Reduced load with separate page loading from 100 cycles in the previous generation to 5 cycles.
The bandwidth of L3 has been increased from 4 pe r cycles in the previous generation to 2 per line.
Support for the CLFLUSHOPT statement to clear ca che lines and control memory ordering of flushed data using SFENCE.
Slow performance for prefetching software that points to a NULL pointer.
L2 associativity changed from 8 to 4 methods.

The latter caught my eye. How are fewer ways to improve? In itself, it seems that fewer methods are strictly worse than more methods. Of course, I understand that there may be reasonable engineering reasons why reducing the number of methods can be a compromise that allows other improvements, but here it is itself positioned as an improvement.

What am I missing?

+7

x86 cpu intel cpu-cache

BeeOnRope Jun 22 '16 at 1:39

source share

1 answer

Peter Cordes · Accepted Answer · 2016-06-22T08:15:02+0000

This is worse for L2 cache performance.

According to this AnandTech SKL-SP (aka skylake-avx512 or SKL-X) entry , Intel stated that "the main reason [to reduce associativity] was to make the design more modular." The Skylake-AVX512 has a 1MiB L2 cache with 16-band associativity.

Presumably, a drop in 4-band associativity does not hurt too much in dual-core and quad-core laptops and desktop computers (SKL-S), since there is a lot of bandwidth for the L3 cache. I think that if Intel's simulations and testing found this to be very painful, they would add extra development time to save the 8-way 256k cache on the non-AVX512 Skylake.

The potential for lower associativity is the power budget. This can indirectly help performance by allowing more turbocharging, but they basically did it to increase efficiency, not speed. Releasing some room in the power budget allows them to spend it elsewhere . Or do not spend it all, but just use less energy.

Processors for mobile and multi-core servers care a lot about the budget for power, which is much more than high-performance quad-core desktop processors.

The heading on the list should read “changes” rather than “improvements” more accurately , but I’m sure that the marketing department will not allow them to write anything that didn’t seem positive .: P At least Intel documents things accurately and in detail, including like new processors are worse than old models.

Anandtech SKL writeup suggests that removing associativity frees up the power budget to increase the L2 bandwidth, which (in the large image) compensates for the increased miss frequency.

IIRC, Intel adheres to a policy according to which any proposed design change should have a ratio of primary growth to the cost of electricity in the ratio of 2: 1 or something like that. Thus, presumably, if they lost 1% of their performance, but save 3% of their power with this L2 change, they will. Number 2: 1 may be correct if I remember it correctly, but the example of 1% and 3% is completely composed.

There was some discussion of this change in one of the podcast interviews that David Kanter did right after the details were released in the IDF. IDK if this is the correct link .

Improved Skylake L2 cache due to reduced associativity?

More articles: