I have a really weird problem that drives me crazy.
I have a Ruby server and a Flash client (Action Script 3). This is a multiplayer game.
The problem is that everything works fine, and then suddenly a random player stops receiving data. When the server closes the connection due to inactivity, after about 20-60 seconds, the client receives all buffered data.
The client uses XMLsocket to retrieve data, so the way the client receives data is not a problem.
socket.addEventListener(Event.CONNECT, connectHandler); function connectHandler(event) { sendData(sess); } function sendData(dat) { trace("SEND: " + dat); addDebugData("SEND: " + dat) if (socket.connected) { socket.send(dat); } else { addDebugData("SOCKET NOT CONNECTED") } } socket.addEventListener(DataEvent.DATA, dataHandler); function dataHandler(e:DataEvent) { var data:String = e.data; workData(data); }
The server resets data after each record, so this is not a flushing problem:
sock.write(data + DATAEOF) sock.flush()
DATAEOF is null char, so the client parses the string.
When the server accepts a new socket, it sets sync to true, autoflush and TCP_NODELAY to true:
newsock = serverSocket.accept newsock.sync = true newsock.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, true)
This is my research:
Information: every day I wrote netstat data to a file.
- When the client stops receiving data, netstat indicates that the socket status remains
ESTABLISHED . - After a few seconds, the send queue grows according to the data sent.
- tcpflow shows that packets are sent 2 times.
- When the server closes the socket, the state of the socket changes to
FIN_WAIT1 , as expected. Then tcpflow shows that all buffered data is sent to the client, but the client does not receive the data. after a few seconds, the connection disappears from netstat and tcpflow shows that the same data is sent again, but this time the client receives the data, so it starts sending data to the server, and the server receives it. But it's too late ... the server closed the connection.
I don’t think this is an OS / network problem because I moved from VPS located in Spain to Amazon EC2 located in Ireland and the problem still remains.
I also don’t think that this is a client network problem, because it happens dozens of times a day, and the average number of online users is about 45-55, with about 400 unique users per day, so the ratio is extremely high.
EDIT: I did more research. I changed the server to C ++.
When the client stops sending data, after a while the server receives the message "Connection reset by peer". At that moment, tcpdump shows me that the client sent the RST packet, this may be due to the client closing the connection and the server trying to read, but ... why did the client close the connection? I think the answer is that the client is not the one who closes the connection, it is the core. Here is some information: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn't handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a "Connection reset by peer" message.
I have completed the following steps, and now it seems that the server closes connections only when the client loses its Internet connection.
I am going to add this as an answer so that people know a little about it.