CUDA: Is FERMI texture cache separate from L1 cache?

Does it make sense to rewrite the code so that it loads data through the texture cache (provided that I do not need filtering and other parameters of the texture unit), or is it the same? How about loading some data through the L1 cache and some texturing? I have code in which I could use such a strategy, but does it make sense at all?

To make it clear, I mean, β€œthis is the FERMI texture cache β€” separate hardware from the L1 cache hardware” β€”in other words, can I smartly get the total volume of the L1 + texture volume for my code?

+7
source share
1 answer

It is separate. Texture load does not pass through L1. For applications without texturing (i.e. you do not use functions such as interpolation and fixing), the main advantage of texturing is that it allows you to selectively add most of the global memory, which can be potentially cached (subject to localization and reuse) without breaking what happens in L1. For small datasets, texturing will not produce better quality than L1. For large datasets, where there is some locality and reuse, but loads from the area that is covered by the texture cache may otherwise exceed L1 (which may be 16 KB per SM on Fermi, depending on the cache configuration) the texture cache may provide the advantage of the application as a whole. It often seems to users that texture usage is not as fast as if things could be cached in L1, but much faster than unopened loads or scattered loads that break L1. Much will depend on the access pattern and data size. The texture cache size is about 8 KB per SM. You can cache a much larger region, but a high level of reuse and locality will definitely improve the texture cache performance. Also note that texture memory is read-only. You might be interested in this webinar .

+11
source

All Articles