Receive disconnect notification using TCP Keep-Alive to block write

I use the TCP Keep-Alive option to detect a dead connection. It works well with a connection using read sockets:

setsockopt(mysock,...) // set various keep alive options epoll_ctl(ep,mysock,{EPOLLIN|EPOLERR|EPOLLHUP},) epoll_wait -> (exits after several seconds when remove host disconnects cable) 

Epoll Awaits EPOLLIN Release | EPOLLHUP on the socket without any problems.

However, if I try to write a lot to the socket until I get EAGAIN and then read and write a poll, I don't get an error message when disconnecting the cable:

 setsockopt(mysock,...) // set various keep alive options while(send() != EAGAIN) ; epoll_ctl(ep,mysock,{EPOLLIN|EPOLLOUT|EPOLERR|EPOLLHUP},) epoll_wait -> --- Never exits!!!! even when the cable of the remove host is disconnected!!! 
  • How can this be solved?
  • Has anyone seen a similar problem?
  • Any possible direction?

Edit: Additional Information

When I track the connection with wirehark, in the first case (of reading) I get a request for ack once every few seconds. But in the second case, I do not find them at all.

+4
source share
3 answers

If you pull out the network connection before all the data has been transferred, the connection will not work, and thus the keepalive timer does not start in some implementations. (Keep in mind that keepalive is NOT part of the TCP specification, and as a result, it is implemented inconsistently, if at all.) In general, due to a combination of exponential delay and a large number of repetitions ( tcp_retries2 is 15 by default), this can take up to 30 minutes to retry transmission before the timeout before the timer starts.

The workaround, if one exists, depends on the specific TCP implementation that you are using. Some newer versions of Linux (kernel version 2.6.37 released on January 4, 2011) implement TCP_USER_TIMEOUT. More details here .

It is generally recommended that you use application-level communication timeouts rather than using TCP-based keepalive. See, for example, HTTP Keep-Alive .

+13
source

Even if you already set the keepalive parameter in your application socket, you will not be able to determine the status of a dead socket connection in a timely manner if your application continues to write to the socket. This is due to the retransmission of tcp in the tcp kernel stack. tcp_retries1 and tcp_retries2 are the kernel parameters for setting the tcp retransmission timeout. It is difficult to predict the exact timeout of the retransmission because it is calculated by the RTT mechanism. You can see this calculation in rfc793. (3.7. Data Transfer)

https://www.rfc-editor.org/rfc/rfc793.txt

All platforms have kernel configurations for retransmission of tcp.

 Linux : tcp_retries1, tcp_retries2 : (exist in /proc/sys/net/ipv4) 

http://linux.die.net/man/7/tcp

 HPUX : tcp_ip_notify_interval, tcp_ip_abort_interval 

http://www.hpuxtips.es/?q=node/53

 AIX : rto_low, rto_high, rto_length, rto_limit 

http://www-903.ibm.com/kr/event/download/200804_324_swma/socket.pdf

You must set a lower value for tcp_retries2 (default is 15) if you want early detection of a dead connection, but this is not an exact time, as I said. In addition, you cannot currently set these values ​​for only one socket. These are global kernel parameters. There have been several attempts to apply the tcp retransmission socket option to a single socket ( http://patchwork.ozlabs.org/patch/55236/ ), but I do not think that it was applied in the core core. I cannot find such a definition of parameters in system header files.

For reference, you can track your keepalive socket option through 'netstat -timers', as shown below. https://stackoverflow.com/questions/34914278

 netstat -c --timer | grep "192.0.0.1:43245 192.0.68.1:49742" tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (1.92/0/0) tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (0.71/0/0) tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (9.46/0/1) tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (8.30/0/1) tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (7.14/0/1) tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (5.98/0/1) tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (4.82/0/1) 

In addition, when keepalive timeout ocurrs, you may encounter different return events depending on the platforms you are using, so you should not determine the dead state of the connection only with return events. For example, HP returns the POLLERR event, and AIX returns the POLLIN event when the keepalive timeout occurs. At this time, you will encounter the ETIMEDOUT error in the recv () call.

In the latest kernel version (starting from version 2.6.37) you can use the TCP_USER_TIMEOUT parameter, which will work well. This parameter can be used for a single socket.

+1
source

A few points that I would like to touch upon.

1) According to this document , here is what is needed to use keepalive on Linux:

Linux has built-in keepalive support. You must enable TCP / IP networks to use it. You also need procfs support and sysctl support for tuning kernel parameters at runtime.

Procedures involving keepalive use three user-controlled variables:

 tcp_keepalive_time 

> interval between the last data packet sent (simple ACKs are not considered data) and the first probing sounding; after connection is marked as mandatory keepalive, this counter is no longer used

 tcp_keepalive_intvl 

> interval between consecutive keepalive probes, regardless of what the connection exchanged in the meantime

 tcp_keepalive_probes 

> the number of unconfirmed probes to send before considering the connection is dead and application level notification

Remember that keepalive support, even if it is configured in the kernel, is not the default behavior on Linux. Programs should request keepalive management of their sockets using the setsockopt interface. relatively few programs that implement keepalive, but you can easily add keepalive for most of them, following the instructions explained later in this document.

Try looking at the current values ​​of these variables in your current system to make sure they are correct or make sense. The bold highlight is mine, and it looks like you are doing it.

I assume that the values ​​for these variables are in milliseconds, but are not sure if you double check.

 tcp_keepalive_time 

I expect the value to mean something around "ASAP, after sending the last data packet, send the first probe"

 tcp_keepalive_intvl 

I assume that the value for this variable must be something less than the default TCP time in order to disconnect the connection.

 tcp_keepalive_probes 

It may be the "magical meaning" that makes or breaks your application; if the number of unconfirmed probes is too large, this may cause epoll_wait() never exit.

The document discusses the implementation of Linux keepalive in versions of the Linux kernel (2.4.x, 2.6.x), as well as how to write applications with TCP keepalive support in C.

http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/

2) Make sure you do not specify -1 in the timeout argument in epoll_wait() , because it causes epoll_wait() block indefinitely.

 int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); 

The timeout argument specifies the minimum number of milliseconds that epoll_wait () will block. (This interval will be rounded to system clock synchronization and delays in kernel scheduling mean that the blocking interval can exceed a small amount.) Specifying a timeout of -1 causes epoll_wait () to lock indefinitely, indicating a timeout of zero, calls epoll_wait () to return immediately if there are no events.

On the manual page http://linux.die.net/man/2/epoll_wait

0
source

All Articles