In my application, I use the infiniband infrastructure to send a data stream from the server to another. I used to simplify ip development compared to infiniband because I am more familiar with socket programming. So far, performance (max bw) has been good enough for me (I knew I was not getting the maximum bandwidth), now I need to get out of this endless connection and increase the bandwidth.
ib_write_bw claims that my maximum throughput is about 1500 MB / s (I do not get 3000 MB / s because my card is installed in PCI 2.0 8x).
So far so good. I encoded my communication channel using ibverbs and rdma, but I get much less than the bandwidth I can get, I even get a little less bandwidth than using a socket, but at least my application does not use processor power:
ib_write_bw: 1500 MB / s
: 700 MB / s <= One core of my system is 100% during this test
ibvers + rdma: 600 MB / s <= No CPU at all used during this test
The bottleneck seems to be here:
ibv_sge sge; sge.addr = (uintptr_t)memory_to_transfer; sge.length = memory_to_transfer_size; sge.lkey = memory_to_transfer_mr->lkey; ibv_send_wr wr; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.opcode = IBV_WR_RDMA_WRITE; wr.sg_list = &sge; wr.num_sge = 1; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr; wr.wr.rdma.rkey = thePeerMemoryRegion.rkey; ibv_send_wr *bad_wr = NULL; if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) { notifyError("Unable to ibv post receive"); }
At this point, the following code is awaiting completion:
//Wait for completation ibv_cq *cq; void* cq_context; if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) { notifyError("Unable to get a ibv cq event"); } ibv_ack_cq_events(cq, 1); if (ibv_req_notify_cq(cq, 0) != 0) { notifyError("Unable to get a req notify"); } ibv_wc wc; int myRet = ibv_poll_cq(cq, 1, &wc); if (myRet > 1) { LOG(WARNING) << "Got more than a single ibv_wc, expecting one"; }
The time from my ibv_post_send and when ibv_get_cq_event returns a 13.3ms event when transmitting 8 MB chuncks, then about 600 MB / s.
To indicate more (in pseudo code, what am I doing globally):
Active side:
post a message receive rdma connection wait for rdma connection event <<at this point transfer tx flow starts>> start: register memory containing bytes to transfer wait remote memory region addr/key ( I wait for a ibv_wc) send data with ibv_post_send post a message receive wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms) send message "DONE" unregister memory goto start
Passive side:
post a message receive rdma accept wait for rdma connection event <<at this point transfer rx flow starts>> start: register memory that has to receive the bytes send addr/key of memory registered wait "DONE" message unregister memory post a message receive goto start
Does anyone know what I'm doing wrong? Or what can I improve? I am not affected by the βDid not invent hereβ syndrome, so Iβm even ready to drop what Iβve done so far and accept something else. I need only continuous point-to-point transfer.