UDP sendto loopback performance

Background

I have a very high network application with bandwidth / low latency (the target is <5 ΞΌs per packet), and I wanted to add some monitoring / metrics to it. I heard about the passion for statsd and it seems to be an easy way to collect indicators and feed them into our time series database. Sending indicators is done through a small udp package for the daemon (usually it runs on the same server).

I wanted to characterize the effects of sending ~ 5-10 udp packets in my data path in order to understand how much latency he would add and was surprised how bad it is. I know that this is a very obscure micro benchmark, but I just wanted to get a general idea of ​​where it lands.

I have a question

I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost compared to the remote host. Are there any tricks I can do to reduce the delay for sending a UDP packet? I think that the solution for me is to push the metric collection to the auxiliary kernel or actually run the statsd daemon on a separate host.


My settings / tests

CentOS 6.5 with a little hardware.
The client testing program I used is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
Receiver side works nc -ulk 127.0.0.1 12000 > /dev/null (change ip for IF)

I conducted this micro test with the following devices.
Some test results:

  • loop
    • Packet size 500 // Time on sendto () 2159 nanosec // Total time 2.159518
  • built-in controller 1 GB
    • Packet size 500 // Departure time () 397 nanoseconds // Total time 0.397234
  • intel ixgbe 10 Gb
    • Packet size 500 // Departure time () 449 nanosec // Total time 0.449355
  • solarflare 10 Gb with user space stack (onload)
    • Packet size 500 // Departure time () 317 nanoseconds // Total time 0.317229
+7
performance c linux udp sockets
source share
2 answers

Writing to loopback will not be an efficient way to bundle an interprocess process for profiling. Typically, a buffer will be copied several times before it is processed, and you run the risk of dropping packets because you use udp. You also make additional calls in the operating system, so you add to the risk of context switching (~ 2us).

Goal <<<5 usec per package

Is this a tough requirement in real time or a soft requirement? Usually, when you process things in microseconds, profiling should be zero overhead. Are you using a solar flare ?, so I think you are serious. The best way I know is to use a physical line and sniff traffic for metrics. A number of products do this.

+3
source share

i / o to a disk or network is very slow if you include it in a very hard (real-time) processing cycle. The solution may be to offload i / o into a separate task with a lower priority. Let real-time loops pass messages to the I / O task through (without blocking) the queue.

-one
source share

All Articles