IOMMU is very useful in that it provides a set of display registers. It can organize access to any physical memory within the range of addresses available to the device, and this can lead to the fact that physically scattered buffers will look adjacent to the devices. This is not good for third-party PCI / PCI-Express cards or remote computers trying to access the raw physical offset of the nVidia GPU, as this may lead to the fact that they will not really have access to the alleged memory areas or restrict / restrict access to him for one based on the IOMMU module. This must be disabled because
"RDMA for GPUDirect currently relies on all physical addresses that are the same in terms of PCI devices .
-nVidia, Design Considerations for rDMA and GPUDirect
When drivers try to use the CPU MMU and map memory I / O (MMIO) areas for use in kernel space, they usually store the returned address from memory into themselves. Since each driver operates within its own context or namespace, exchanging these mappings between nVidia drivers and other third-party vendor drivers who want to support rDMA + GPUDirect will be very complex and will lead to a vendor-specific solution (possibly even product-specific, if drivers vary widely between third-party products). In addition, operating systems today do not currently have a good solution for exchanging MMIO mappings between drivers, so nVidia exports several functions that allow third-party drivers to easily access this information from the kernel space itself.
nVidia uses the use of "physical addressing" to access each card through rDMA for GPUDirect. This greatly simplifies the process of moving data from one computer to a remote PCI-Express bus using the physical addressing scheme of the computer, without worrying about the problems associated with virtual addressing (for example, resolving virtual addresses for physical addresses). Each card has a physical address on which it is located, and it can be accessed with this offset; only a small bit of logic should be added to a third-party driver trying to perform rDMA operations. In addition, these 32- or 64-bit base address registers are part of the standard PCI configuration space, so the physical address of the card can be easily obtained simply by reading the BAR from it, instead of getting the mapped address obtained by the nVidia driver when attaching to the card. NVidia's Universal Virtual Addressing (UVA) performs the above mappings of physical addresses into a seemingly adjacent memory area for user-space applications, for example:

These memory areas are further divided into three types: CPU, GPU and FREE, which are all documented here .
Back to your use case: since you are in user space , you do not have direct access to the physical address space of the system, and the addresses that you use are probably the virtual addresses provided to you by nVidia UVA. Assuming no previous allocations have been made, your memory allocation should be at offset + 0x00000000, which will cause you to see the same GPU offset. If you allocated a second buffer, I assume that you will see that this buffer starts immediately after the end of the first buffer (with an offset of + 0x00100000 from the base virtual address of the GPU in your case with a distribution of 1 MB).
However, if you were in kernel space and wrote a driver for your corporate card to use rDMA for GPUDirect, you would use the 32- or 64-bit physical addresses assigned to the GPU from the system BIOS and / or OS to the rDMA data directly to and with GPU.
In addition, it may be worth noting that not all DMA modules actually support virtual addresses for transmissions - in fact, most of them require physical addresses, since it can be difficult to process virtual addresses from a DMA engine (p. 7), therefore many DMA mechanisms lack support for this.
To answer the question from your message header, nVidia currently only supports physical addressing for rDMA + GPUDirect in kernel space . For the user application space , you will always use the virtual GPU address provided to you by nVidia UVA, which is located in the virtual address space of the CPU.
In relation to your application, a simplified process breakdown that you can perform for rDMA operations:
- The user space application creates buffers that enter the amount of unified virtual addressing space. nVidia provides (virtual addresses).
- Call
cuPointerGetAttribute(...) to get P2P tokens; these tokens refer to memory within the CUDA context. - Send all this information to the kernel space in some way (e.g. IOCTL, read / write to the driver, etc.). At a minimum, you want these three things to end up in your core space :
- P2P icon returned by
cuPointerGetAttribute(...) - Virtual address of UVA buffer (s)
- Buffer Size (s)
- Now translate these virtual addresses to their corresponding physical addresses by calling the nVidia kernel functions, since these addresses are stored in the nVidia page tables and can be accessed with the nVidia exported function, for example:
nvidia_p2p_get_pages(...) , nvidia_p2p_put_pages(...) , and nvidia_p2p_free_page_table(...) . - Use these physical addresses obtained in the previous step to initialize your DMA mechanism, which will manage these buffers.
A more detailed explanation of this process can be found here .