Why structure with extra fields is faster

I just found this library that provides a non-locking ring that works faster than channels: https://github.com/textnode/gringo (and it works very fast, especially with GOMAXPROCS> 1)

But the interesting part is the structure for managing queue status:

type Gringo struct { padding1 [8]uint64 lastCommittedIndex uint64 padding2 [8]uint64 nextFreeIndex uint64 padding3 [8]uint64 readerIndex uint64 padding4 [8]uint64 contents [queueSize]Payload padding5 [8]uint64 } 

If I delete the paddingX [8] uint64 fields, it runs about 20% slower. How can it be?

Also appreciate if someone explained why this blocking algorithm is much faster than channels, even buffered ones?

+8
concurrency parallel-processing go
source share
2 answers

Filling eliminates the false exchange by putting each structure in its own cache line. If two variables share a cache line, reading an unchanged variable will be as expensive as reading a changed variable if an intermediate record is in another variable.

When a variable is read across multiple cores and does not change, the cache line is split by cores. This makes reading very cheap. Before any core can write any part of this cache line, it must invalidate the cache line on other kernels. If any core later reads from this cache line, it will find the cache line invalid and will return to its sharing. This makes the patient extra coherent traffic, when one variable often changes, and the other is often read.

+11
source share

It works faster because it does not require locks. This is a Java implementation (called Disruptor) that works very well and seems to be the inspiration for gringo. They explain the cost of locks and how to increase throughput here .

Regarding completion, the document also alludes to some of the reasons. Basically: the processor caches. This document explains this well. You can get a huge increase in performance due to the fact that the processor gets access to the level 1 cache instead of going through memory or external caches as often as possible. But this requires additional precautions, since the processor will fully load its cache and reload it (from memory or level 2-3 caches) every time it is needed. In the case of a parallel data structure, as @David Schwartz said, a false exchange will cause the processor to reload its cache more often, as some data can be loaded into the rest of the memory line, modified, and the cache will be forced to load again.

+3
source share

All Articles