Unix Domain Latency Measurement

Question

Unix Domain Latency Measurement

I want to compare the performance of Unix domain sockets between two processes with different IPCs.

I have a basic program that creates a pair of sockets and then calls fork. It then measures the RTT to send 8192 bytes to another process and vice versa (different for each iteration).

#include <assert.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <sys/time.h> #include <sys/types.h> #include <sys/socket.h> #include <unistd.h> int main(int argc, char **argv) { int i, pid, sockpair[2]; char buf[8192]; struct timespec tp1, tp2; assert(argc == 2); // Create a socket pair using Unix domain sockets with reliable, // in-order data transmission. socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair); // We then fork to create a child process and then start the benchmark. pid = fork(); if (pid == 0) { // This is the child process. for (i = 0; i < atoi(argv[1]); i++) { assert(recv(sockpair[1], buf, sizeof(buf), 0) > 0); assert(send(sockpair[1], buf, sizeof(buf), 0) > 0); } } else { // This is the parent process. for (i = 0; i < atoi(argv[1]); i++) { memset(buf, i, sizeof(buf)); buf[sizeof(buf) - 1] = '\0'; assert(clock_gettime(CLOCK_REALTIME, &tp1) == 0); assert(send(sockpair[0], buf, sizeof(buf), 0) > 0); assert(recv(sockpair[0], buf, sizeof(buf), 0) > 0); assert(clock_gettime(CLOCK_REALTIME, &tp2) == 0); printf("%lu ns\n", tp2.tv_nsec - tp1.tv_nsec); } } return 0; }

However, I noticed that for each retest, the elapsed time for the first run (i = 0) is always an outlier:

 79306 ns 18649 ns 19910 ns 19601 ns ...

I wonder if the kernel should do some final configuration on the first send() call — for example, allocate 8192 bytes in the kernel to buffer data between calls to send() and recv() ?

+6

performance c benchmarking unix sockets

user4099632 Jul 31 '15 at 10:45

source share

3 answers

I would suggest that command cache misses for the involved kernel code are a significant part of the slowdown for the first time. Probably the data cache misses for kernel data structures that track material.

Perhaps there is a lazy setting.

You can test by doing sleep(10) between trials (including before the first test). Do something that will use the entire processor cache, for example, refresh a web page between each test. If this is a lazy setting, then the first call will be too slow. If not, then all calls will be equally slow if the caches are cold.

+1

Peter Cordes Jul 31 '15 at 17:18

source share

In the linux kernel you can find the ___sys_sendmsg function that send uses. To view the code, check here .

The function should copy the user's message (in your case 8KB buf ) from user space to kernel space. After that, recv can copy the received message from the kernel space to the user space of the child process.

This means that you need to have 2 memcpy and one kmalloc for the send () pair of recv ().

The first one is so special because the space where the user message is stored is not allocated . It also means that it is also missing from the data cache. therefore, the first pair of send() - recv() will allocate kernel memory, where buf will be stored, and will also be cached. The following calls will use this memory only with the used_address argument in the function prototype.

So your assumption is correct. The first run allocates 8 KB in the kernel and uses cold caches, while the rest just use the previously allocated and cached data.

+1

VAndrei Jul 31 '15 at 18:43

source share

Willy tarreau · Accepted Answer · 2015-08-03T06:00:22+0000

This is not a copy of the data that takes 80 additional microseconds, which will be extremely slow (only 100 MB / s), it is a fact that you use two processes and that when the parent sends the data for the first time, this data must wait until the child finishes plug and start execution.

If you absolutely want to use the two processes, you must first send in the other direction so that the parent can wait until the child is ready to start sending.

For example: Child:

  send(); recv(); send();

Parent:

  recv(); gettime(); send(); recv(); gettime();

You also need to understand that your test largely depends on the placement of the process on different CPU cores and when launched on the same core will cause a task switch.

For this reason, I highly recommend that you take the measurement using a single process. Even without polling and nothing, you can do this provided that you store fairly small blocks that fit into socket buffers:

 gettime(); send(); recv(); gettime();

You must first do an round trip in order to ensure that buffers are allocated. I am pretty sure that you will get much less time here.

Unix Domain Latency Measurement

More articles: