Memory Allocation and NUMA Hardware Access

I am developing a scientific computing tool in python that should be able to distribute work across multiple cores in an NUMA shared memory environment. I am considering the most effective way to do this.

Threads - exit the game due to the blocking of the global python interpreter, which leaves the plug as my only option. For interactions between processes, I assume that my parameters are pipes, sockets, or mmap. Please indicate this if there are no items on this list.

My application will require quite some communication between processes and access to some amount of common data. My main problem is latency.

My questions are: when I unlock the process, will its memory be located near the kernel to which it is assigned? Like a fork in * nix copies when writing, I initially assume that this cannot be. Can I force a copy for faster access to memory, and if so, how to do it? If I use mmap for communication, can this memory still be distributed across the cores or will it be located on one? Is there a process that transparently moves data to optimize access? Is there a way to have direct control over physical distribution or a way to request distribution information to optimize care?

At a higher level, which of these things is determined by my hardware, and which is the operating system? I buy a high-quality multiprocessor machine and doubt between AMD Opteron and Intel Xeon. What are the implications of specific equipment for any of the above issues?

+4
source share
2 answers

Since one of Python's Achilles spots is GIL, it is better to support multiprocessing. For example, there are queues, channels, locks, shared values, and shared arrays. There is also something called a manager that allows you to wrap many Python data structures and share them in an IPC-friendly way. I believe that most of them work through pipes or sockets, but I did not delve too deeply into the internal parts.

http://docs.python.org/2/library/multiprocessing.html

How does Linux model NUMA systems?

The kernel discovers that it runs on a multi-core machine, and then discovers how much equipment is there and what the topology is. He then creates a model of this topology using the idea of ​​nodes. A Node is a physical socket that contains a CPU (possibly with multiple cores) and memory attached to it. Why is Node used instead of the base? Since the memory bus is the physical wires that connect the RAM to the processor socket, and all the cores on the processor in the same socket will have the same access time to the entire RAM that is on this memory bus.

How is the memory on one memory bus accessed by the kernel on the other memory bus?

On x86 systems, this happens through caches. Modern operating systems use hardware called the Translation Lookaside (TLB) buffer to map virtual addresses to physical addresses. If the memory that the cache was programmed is local, it is read locally. If it is not local, it will transfer to the Hyper Transport bus on AMD systems or QuickPath on Intel to remote memory. Since this is done at the cache level, you theoretically need not know about it. And you, of course, have no control over this. But for high-performance applications, it’s incredibly useful to understand in order to minimize the number of remote accesses.

Where does the OS actually find the physical pages of virtual memory?

When a process forks, it inherits all the parent pages (due to COW). The kernel has the idea that Node is the "best" for a process that is a "preferred" node. This can be changed, but by default it will be the same as the parent by default. When allocating memory, by default the same Node will be used as the parent if it has not been explicitly changed.

Is there a transparent process that moves memory?

No. When a memory allocation is allocated, it is committed to the Node to which it was allocated. You can create a new selection on another node, transfer the data and release it on the first node, but this is a bit more complicated.

Is there a way to control the distribution?

By default, the local node is used. If you use libnuma, you can change the allocation method (say round-robin or interleave) instead of using the local one by default.

I got a lot of information from this blog post:

http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

I would definitely recommend you read it in its entirety to get more information.

+2
source
##client.py import socket import sys HOST, PORT = "localhost", 9999 data = " ".join(sys.argv[1:]) sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) try: sock.connect((HOST, PORT)) sock.sendall(bytes(data +"\n", "utf-8")) received = str(sock.recv(1024), "utf-8") finally: sock.close() print("Sent: {}".format(data)) print("Received: {}".format(received)) ##server.py import socketserver class MyTCPHandler(socketserver.BaseRequestHandler): def handle(self): self.data = self.request.recv(1024).strip() print("{} wrote:".format(self.client_address[0])) print(self.data) self.request.sendall(self.data.upper()) if __name__ == "__main__": HOST, PORT = "10.0.0.1", 9999 server = socketserver.TCPServer((HOST, PORT), MyTCPHandler) server.serve_forever() 
-2
source

All Articles