Rdma infinite transmission bad

In my application, I use the infiniband infrastructure to send a data stream from the server to another. I used to simplify ip development compared to infiniband because I am more familiar with socket programming. So far, performance (max bw) has been good enough for me (I knew I was not getting the maximum bandwidth), now I need to get out of this endless connection and increase the bandwidth.

ib_write_bw claims that my maximum throughput is about 1500 MB / s (I do not get 3000 MB / s because my card is installed in PCI 2.0 8x).

So far so good. I encoded my communication channel using ibverbs and rdma, but I get much less than the bandwidth I can get, I even get a little less bandwidth than using a socket, but at least my application does not use processor power:

ib_write_bw: 1500 MB / s

: 700 MB / s <= One core of my system is 100% during this test

ibvers + rdma: 600 MB / s <= No CPU at all used during this test

The bottleneck seems to be here:

ibv_sge sge; sge.addr = (uintptr_t)memory_to_transfer; sge.length = memory_to_transfer_size; sge.lkey = memory_to_transfer_mr->lkey; ibv_send_wr wr; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.opcode = IBV_WR_RDMA_WRITE; wr.sg_list = &sge; wr.num_sge = 1; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr; wr.wr.rdma.rkey = thePeerMemoryRegion.rkey; ibv_send_wr *bad_wr = NULL; if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) { notifyError("Unable to ibv post receive"); } 

At this point, the following code is awaiting completion:

 //Wait for completation ibv_cq *cq; void* cq_context; if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) { notifyError("Unable to get a ibv cq event"); } ibv_ack_cq_events(cq, 1); if (ibv_req_notify_cq(cq, 0) != 0) { notifyError("Unable to get a req notify"); } ibv_wc wc; int myRet = ibv_poll_cq(cq, 1, &wc); if (myRet > 1) { LOG(WARNING) << "Got more than a single ibv_wc, expecting one"; } 

The time from my ibv_post_send and when ibv_get_cq_event returns a 13.3ms event when transmitting 8 MB chuncks, then about 600 MB / s.

To indicate more (in pseudo code, what am I doing globally):

Active side:

 post a message receive rdma connection wait for rdma connection event <<at this point transfer tx flow starts>> start: register memory containing bytes to transfer wait remote memory region addr/key ( I wait for a ibv_wc) send data with ibv_post_send post a message receive wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms) send message "DONE" unregister memory goto start 

Passive side:

 post a message receive rdma accept wait for rdma connection event <<at this point transfer rx flow starts>> start: register memory that has to receive the bytes send addr/key of memory registered wait "DONE" message unregister memory post a message receive goto start 

Does anyone know what I'm doing wrong? Or what can I improve? I am not affected by the β€œDid not invent here” syndrome, so I’m even ready to drop what I’ve done so far and accept something else. I need only continuous point-to-point transfer.

+4
source share
2 answers

I solved the problem of allocating my buffers for transmission with reference to page size. On my system page, the size is 4K (the value is returned by sysconf (_SC_PAGESIZE)). In doing so, I can (I am still doing the registration / registration) to now reach about 1400 MB / s.

+1
source

Depending on your pseudo code, it looks like you will register and unregister the memory area for each transmission. I think that probably the main reason is the slow operation: registering memory is a rather expensive operation, so you want to make it as small as possible and make maximum use of the memory area. All the time spent on registering memory is the time when you are not wasting data transfers.

This indicates a second problem with your pseudo-code: you are waiting for completion synchronously and do not send another worker request until the previous one is completed. This means that during the time that the job request completes until you receive the completion and publish another request, the HCA does not work. You are much better off carrying multiple send / receive requests in flight, so that when the HCA completes one working request, it can go directly to the next.

+2
source

All Articles