It depends on many things, but above all:
- OS
- the
malloc implementation that you use
The OS is responsible for allocating "virtual memory" to which your process has access and creates a translation table that maps virtual memory to actual memory addresses.
Now the standard malloc implementation is usually conservative and will just have a giant castle around it all. This means that requests are processed sequentially, and the only thing that stands out from several threads instead of one slows it down.
There are more intelligent distribution schemes, usually based on pools, and can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) - these are two such implementations designed for high performance in multi-threaded applications.
However, there is no silver bullet, and at some point, the OS must perform a virtual <=> real translation, which requires some form of locking.
Itβs best to highlight the arenas:
- Select large pieces (arenas) immediately
- Separate them in arrays of the appropriate size.
There is no need to parallelize the distribution of the arena, and it would be better for you to ask about the largest arenas that you can (remember that requests for the distribution of too large amounts may fail), then you can parallelize the separation.
tcmalloc and jemalloc may help a little, however they are not intended for large distributions (which is unusual), and I do not know if it is possible to adjust the size of the arena that they request.
source share