There are many questions here, so I will keep the answers short.
How does the processor decide how data is loaded into the L2 cache?
Everything you use is loading. L2 behaves in the same way as L1, except that there is more of it, and smoothing (which can lead to premature eviction) is more common due to large lines and a smaller population. Some CPUs load only L2 with data that is pushed from L1, but this does not matter much to the programmer.
Most MMUs have a facility for undisclosed memory, but this is for device drivers. I don't remember ever seeing an opportunity to disable L2 without disconnecting L1. Without caching, you do not get performance.
How can a programmer check what cache line size is for any given architecture?
Consult the user manual. Some operating systems provide a request object, such as sysctl .
How can a programmer ensure that data is organized so that it fits into cache lines?
The basic idea is spatial locality. Data accessed simultaneously, by the same internal loop, must be included in the same data structure. The optimal organization is to put this structure in the cache line and align it with the size of the cache line.
Do not worry if you do not carefully use your profiler as a guide.
Is data binding to byte boundaries the only thing that can be done to help this process?
No, the other part avoids filling the cache with extraneous data. If some fields will be used only by some other algorithm, they lose cache space while the real algorithm is running. But you cannot optimize all the time, and reorganizing data structures requires programming efforts.
What can a programmer do to minimize cache misses?
Profile using real world data and treat excessive misses as an error.
What profiling tools are available to help visualize the optimization process for windows and Linux platforms?
Cachegrind is very good, but uses a virtual machine. Intel V-Tune uses your real hardware, for better or for worse. I did not use the latter.