NUMA Dedicated Memory Allocation

Question

NUMA Dedicated Memory Allocation

On Linux systems, the pthreads library provides us with a function (posix_memalign) to align the cache to prevent false sharing. And to select a specific NUMA node architecture, we can use the libnuma library. I want something in need of two. I associate certain threads with some specific processors, and I want to allocate local data structures for each thread from the corresponding NUMA node to reduce the latency in memory operations for threads. How can i do this?

+7

linux pthreads malloc caching numa

Mustafa Zengin Nov 16 '11 at 15:29

source share

2 answers

The numa_alloc _ * () functions in libnuma allocate entire pages of memory, typically 4096 bytes. Cache lines are typically 64 bytes. Since 4096 is a multiple of 64, everything returned from numa_alloc _ * () will already be stored at the cache level.

Beware of the numa_alloc _ * () function. The man page says that they are slower than the corresponding malloc (), which I am sure is true, but the much bigger problem I discovered is that simultaneous extraction from numa_alloc _ * () is performed on many cores immediately suffer from serious conflict issues. In my case, replacing malloc () with numa_alloc_onnode () was an erase (everything that I acquired using local memory was offset by an increase in allocation / free time); tcmalloc was faster than any other. I ran thousands of 12-16kb mallocs on 32 threads / cores at a time. Synchronization experiments showed that this was not the single-threaded speed numa_alloc_onnode (), which was responsible for the long period of time spent by my process on performing distributions, which left blocking / conflict problems as a probable cause. The solution I made is numa_alloc_onnode () large chunks of memory once, and then allocate it to the threads running on each node as needed. I use the built-in gcc kernels so that each thread (I write threads to the processor) to capture from the memory allocated on each node. You can use the cache line size to distribute distributions as they are created, if you want: I do. This approach even surpasses tcmalloc pants (which is known by the thread, but not by NUMA) - at least the version of Debain Squeeze does not seem). The disadvantage of this approach is that you cannot release individual distributions (well, not without a lot of work, anyway), you can release all the basic on-node distributions. However, if this is a temporary node field for a space to call a function, otherwise you can specify exactly when this memory is no longer needed, then this approach works very well. This helps if you can predict how much memory you need to allocate on each node too, obviously.

@nandu: I will not publish the full source - it is long and in places tied to something else, which I do, which makes it less transparent. What I'll post is a slightly abridged version of my new malloc () function to illustrate the main idea:

 void *my_malloc(struct node_memory *nm,int node,long size) { long off,obytes; // round up size to the nearest cache line size // (optional, though some rounding is essential to avoid misalignment problems) if ((obytes = (size % CACHE_LINE_SIZE)) > 0) size += CACHE_LINE_SIZE - obytes; // atomically increase the offset for the requested node by size if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) { fprintf(stderr,"Out of allocated memory on node %d\n",node); return(NULL); } else return((void *) (nm->ptr[node] + off)); }

where struct node_memory is

 struct node_memory { long bytes; // the number of bytes of memory allocated on each node char **ptr; // ptr array of ptrs to the base of the memory on each node long *off; // array of offsets from those bases (in bytes) int nptrs; // the size of the ptr[] and off[] arrays };

and nm-> ptr [node] is configured using the libnuma numa_alloc_onnode () function.

Usually, I keep valid node information in the structure in the structure, so my_malloc () can check node requests intelligently without making function calls; I also verify that nm exists, and this size is reasonable. The __sync_fetch_and_add () function is a built-in atomic function of gcc; if you are not compiling with gcc, you need something else. I use atomics because in my limited experience they are much faster than mutexes in conditions of a high level of thread / core (as on 4P NUMA machines).

+9

Rob_before_edits Nov 25 '11 at 20:10

source share

Mysticial · Accepted Answer · 2011-11-17T01:00:20+0000

If you just want to get alignment functionality around a NUMA distributor, you can easily create your own.

The idea is to call an unequal malloc() with a bit more space. Then return the first aligned address. To free it, you need to save the base address in a known place.

Here is an example. Just replace the names with what fits:

 pint // An unsigned integer that is large enough to store a pointer. NUMA_malloc // The NUMA malloc function NUMA_free // The NUMA free function void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){ // The NUMA malloc function void *ptr = numa_malloc( (size_t)(bytes + align + sizeof(pint)), /* NUMA parameters */ ); if (ptr == NULL) return NULL; // Get aligned return address pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align); // Save the free pointer ret[-1] = (pint)ptr; return ret; } void my_NUMA_free(void *ptr){ if (ptr == NULL) return; // Get the free pointer ptr = (void*)(((pint*)ptr)[-1]); // The NUMA free function numa_free(ptr); }

When you use this, you need to call my_NUMA_free for anything allocated using my_NUMA_malloc .

NUMA Dedicated Memory Allocation

More articles: