DataState Class:
I thought it would be a stack or a queue, but it is not, so push / pull does not seem like good names for methods. (Or the implementation is completely fictitious).
This is just a latch that allows you to read the last event that stores any thread.
There is nothing to stop two records in a row from overwriting an element that has never been read. There is also nothing stopping you from reading the same element twice.
If you just need to copy small blocks of data, the ring buffer looks like a decent approach. But if you do not want to lose events, I do not think that you can use it that way. Instead, simply enter the ring buffer entry, then copy it and use it there. Thus, a single atom operation should increase the buffer position index of the buffer.
Ring buffer
You can make get_next() much more efficient. This line performs atomic post-increment (fetch_add) and atomic exchange:
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
I'm not even sure if this is safe, because xchg can step on fetch_add from another thread. In any case, even if it is safe, it is not ideal.
You do not need it. Make sure that arena_size always has a capacity of 2, then you do not need to use a common counter. You can just let it go, and each thread is modulo for their own use. It will eventually be wrapped, but it's a binary integer so that it turns with a force of 2, which is a multiple of your arena size.
I would suggest saving the I-mask instead of size, so there is no risk of compiling % for anything other than the and statement, even if it is not a compile-time constant. This avoids the 64-bit integral div statement.
template<typename T> class RingAllocator { T *arena; std::atomic_size_t arena_idx; const std::size_t size_mask;
Arena highlighting would be more efficient if you used calloc instead of the new + memset. OS already loads pages before passing them to user space processes (to prevent information leakage), so writing them all is just wasted.
arena = new T[size]; std::memset(arena, 0, sizeof(T) * size);
Writing the pages themselves causes them to be corrupted, so they are tied to real physical pages, not just write mappings for a system-wide common physical zero page (for example, after creating new / malloc / calloc). On a NUMA system, the selected physical page may depend on which thread actually touched the page, and not which thread made the selection. But since you are reusing the pool, the first core to write the page may not be the one that uses it the most.
Maybe something needs to be looked for in microbenchmarks / perf counters.