I'm not sure I fully understand the problems, so I can offer you a solution, but from what you explained, I can have alternative viewpoints that can help you.
I program in C, so what works for me may not be applicable in your case.
Your processors have 12MB L3 and 6MB L2, which are large, but in my opinion they are rarely large enough!
You are probably using rdtsc to sync individual partitions. When I use it, I have a statistics structure to which I send measurement results from different parts of the executable code. The average, minimum, maximum, and number of observations are obvious, but there is also a standard deviation, as it can help you decide whether to investigate a large maximum value or not. The standard deviation needs to be calculated only when it needs to be read: before that, it can be stored in its components (n, sum x, sum x ^ 2). If you do not synchronize very short sequences, you can omit the previous synchronization instruction. Make sure that you measure time overhead, at least to exclude them as insignificant.
When I program multithreading, I try to make the task of each core / thread as much as possible "limited memory". By a limited restriction, I mean not to do something that requires unnecessary access to memory. Unnecessary access to memory usually means as much embedded code and as convenient access to the OS as possible. For me, the OS is a great unknown in terms of how much memory is working, the call will be generated, so I'm trying to minimize its requirements. In the same way, but as a rule, to a lesser extent affecting performance, I try to avoid calling application functions: if they should be called, I would prefer that they do not call many other things.
In the same way, I minimize memory allocation: if I need several, I add them together to one, and then divide one large allocation into smaller ones. This will help later allocations in that they will have to go through fewer blocks before finding the returned block. I only block initialization when it is absolutely necessary.
I am also trying to reduce code size by nesting. When moving / installing small blocks of memory, I prefer to use intrinsics based on rep movsb and rep stosb rather than calling memcopy / memset, which are usually optimized for large blocks and not particularly limited in size.
I just recently started using spinlocks, but I implement them so that they become inline (something is better than calling the OS!). I assume that the alternative to the OS is the critical sections, and although they are fast local spindle blocks faster. Since they perform additional processing, this means that they do not allow processing applications during this time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp) { slp->lock_part=0; } inline char spinlock_failed (SPINLOCK *slp) { return (char) __xchg (&slp->lock_part,1); }
Or more complex (but not too):
inline char spinlock_failed (SPINLOCK *slp) { if (__xchg (&slp->lock_part,1)==1) return 1; slp->count_part=1; return 0; }
And release
inline void spinlock_leave (SPINLOCK *slp) { slp->lock_part=0; }
or
inline void spinlock_leave (SPINLOCK *slp) { if (slp->count_part==0) __breakpoint (); if (--slp->count_part==0) slp->lock_part=0; }
Part of the account is what I brought from the built-in (and other programming), where it is used to handle nested interrupts.
I am also a big fan of IOCP for their efficiency in handling events and I / O streams, but your description does not indicate if your application can use them. In any case, you seem to save on them, which is good.