The numa_alloc _ * () functions in libnuma allocate entire pages of memory, typically 4096 bytes. Cache lines are typically 64 bytes. Since 4096 is a multiple of 64, everything returned from numa_alloc _ * () will already be stored at the cache level.
Beware of the numa_alloc _ * () function. The man page says that they are slower than the corresponding malloc (), which I am sure is true, but the much bigger problem I discovered is that simultaneous extraction from numa_alloc _ * () is performed on many cores immediately suffer from serious conflict issues. In my case, replacing malloc () with numa_alloc_onnode () was an erase (everything that I acquired using local memory was offset by an increase in allocation / free time); tcmalloc was faster than any other. I ran thousands of 12-16kb mallocs on 32 threads / cores at a time. Synchronization experiments showed that this was not the single-threaded speed numa_alloc_onnode (), which was responsible for the long period of time spent by my process on performing distributions, which left blocking / conflict problems as a probable cause. The solution I made is numa_alloc_onnode () large chunks of memory once, and then allocate it to the threads running on each node as needed. I use the built-in gcc kernels so that each thread (I write threads to the processor) to capture from the memory allocated on each node. You can use the cache line size to distribute distributions as they are created, if you want: I do. This approach even surpasses tcmalloc pants (which is known by the thread, but not by NUMA) - at least the version of Debain Squeeze does not seem). The disadvantage of this approach is that you cannot release individual distributions (well, not without a lot of work, anyway), you can release all the basic on-node distributions. However, if this is a temporary node field for a space to call a function, otherwise you can specify exactly when this memory is no longer needed, then this approach works very well. This helps if you can predict how much memory you need to allocate on each node too, obviously.
@nandu: I will not publish the full source - it is long and in places tied to something else, which I do, which makes it less transparent. What I'll post is a slightly abridged version of my new malloc () function to illustrate the main idea:
void *my_malloc(struct node_memory *nm,int node,long size) { long off,obytes; // round up size to the nearest cache line size // (optional, though some rounding is essential to avoid misalignment problems) if ((obytes = (size % CACHE_LINE_SIZE)) > 0) size += CACHE_LINE_SIZE - obytes; // atomically increase the offset for the requested node by size if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) { fprintf(stderr,"Out of allocated memory on node %d\n",node); return(NULL); } else return((void *) (nm->ptr[node] + off)); }
where struct node_memory is
struct node_memory { long bytes; // the number of bytes of memory allocated on each node char **ptr; // ptr array of ptrs to the base of the memory on each node long *off; // array of offsets from those bases (in bytes) int nptrs; // the size of the ptr[] and off[] arrays };
and nm-> ptr [node] is configured using the libnuma numa_alloc_onnode () function.
Usually, I keep valid node information in the structure in the structure, so my_malloc () can check node requests intelligently without making function calls; I also verify that nm exists, and this size is reasonable. The __sync_fetch_and_add () function is a built-in atomic function of gcc; if you are not compiling with gcc, you need something else. I use atomics because in my limited experience they are much faster than mutexes in conditions of a high level of thread / core (as on 4P NUMA machines).