Is malloc / memcpy function independent of NUMA?

When trying to increase the speed of my applications on computers other than NUMA / standard, I always found that the call to malloc() was a bottleneck, because even in multi-core machines it is shared / synchronized between all the kernels.

I have a PC with NUMA architecture using Linux and C, and I have two questions:

  • On a NUMA machine, since each core has its own memory, will malloc() execute independently on each core / memory without blocking the other cores?
  • In these architectures, how are memcpy() calls made? Could this be called independently on each core, or by calling it when the core blocks others? I may be mistaken, but I remember that memcpy() got the same malloc() problem, that is, when one core uses it, the rest should wait.
+7
source share
2 answers

The NUMA is a shared memory system, so memory access from any processor can reach memory without blocking. If the memory model was message-based, then accessing the remote memory would require the executing processor to request that the local processor perform the required operation. However, in a NUMA system, the remote processor may still affect the closing processor performance due to the use of memory references, although this may depend on the particular architectural configuration.

As for 1, it completely depends on the OS library and malloc. The OS is responsible for representing the memory for each processor / processor as a single space or NUMA. Malloc may or may not be NUMA-aware. But fundamentally, an implementation of malloc may or may not be performed simultaneously with other requests. And the answer from Al (and the related discussion) addresses this issue in more detail.

As with 2, since memcpy consists of a series of loads and storages, the only effect will again be on the potential architectural effects of using memory controllers of other processors, etc.

+4
source
  • Calls in malloc in separate processes will be made regardless of whether you are in the NUMA architecture. Calls in malloc on different threads of the same process cannot be executed independently, because the returned memory is equally accessible for all threads within the process. If you want the memory to be local to a specific thread, read the "Local thread storage" section. I could not find clear documentation on whether the VM and the Linux scheduler can optimize the proximity between kernels, threads, local memory, and local thread storage.
+2
source

All Articles