Thrift TSimpleServer becomes unresponsive after several successful requests

I have a Thrift API that works with a Java application running on Linux. I use the .NET client to connect to the API and perform operations.

The first few calls to the service work fine without errors, but then (seemingly randomly) the call will hang. If I force a shutdown with my client and try to connect again, the service freezes again, or my client has the following error:

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at Thrift.Transport.TStreamTransport.Read(Byte[] buf, Int32 off, Int32 len) (etc.) 

When I use JConsole to get a stream dump, the server is on accept()

 "Thread-1" prio=10 tid=0x00002aaad457a800 nid=0x79c7 runnable [0x00000000434af000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408) - locked <0x00000005c0fef470> (a java.net.SocksSocketImpl) at java.net.ServerSocket.implAccept(ServerSocket.java:462) at java.net.ServerSocket.accept(ServerSocket.java:430) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:113) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:63) 

netstat on the server shows connections to the service port located on TIME_WAIT , which eventually disappear a few minutes after I force the client to finish (as expected).

The code that installs the Thrift service is as follows:

  int port = thriftServicePort; String host = thriftServiceHost; InetAddress adr = InetAddress.getByName(host); InetSocketAddress address = new InetSocketAddress(adr, port); TServerTransport serverTransport = new TServerSocket(address); TServer server = new TSimpleServer(new TServer.Args(serverTransport).processor((org.apache.thrift.TProcessor)processor)); server.serve(); 

Note that we use the TServerTransport constructor, which accepts an explicit host name or IP address. I suspect that I should change it to take a constructor that specifies only the port (eventually binding to InetAddress.anyLocalAddress() ). Alternatively, I suppose, I could configure the service to bind to a "wildcard" address ("0.0.0.0").

I must mention that the service is not hosted on the open Internet. It is hosted on a private network, and I use SSH tunneling to achieve it. Therefore, the host name to which the service is bound is not resolved on my local network (although I can make the initial connection through tunneling). Interestingly, is this something similar to a TCP RMI callback problem ?

Is there a technical explanation for what is happening (if this is a common problem) or additional troubleshooting steps that I can take?

UPDATE

Today we had the same problem, but this time jstack showed that the Thrift server blocks eternal reading from the input stream:

 "Thread-1" prio=10 tid=0x00002aaad43fc000 nid=0x60b3 runnable [0x0000000041741000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22) at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:70) 

Therefore, we need to set the "client timeout" in the TServerSocket constructor. But why did this make the application also refuse connections when accept() blocked?

+7
source share
4 answers

From your stack trace, it seems like you're using TSimpleServer, whose javadocs says

Simple single-threaded server for testing.

It is possible that you want to use TThreadPoolServer .

Most likely, the only TSimpleServer thread is blocked, waiting for a dead client or timeout response. And since TSimpleServer is single-threaded, no thread is available to handle other requests.

+4
source

I have some suggestions. You mentioned that the first few calls to the server work, and then there are hangs. This is the key. One of the scenarios when this happens is when the client does not completely send bytes to the server . I am not familiar with TSimpleServer, but I assume that it listens on a port and has some binary protocol and expects any client to talk to it in that protocol. Your .net client is talking to this server, sending bytes. If it does not properly clean its output buffer, it may not send all bytes to the server, thereby hanging the server.

In Java, this can happen on the client side, for example:

 BufferedOutputStream stream = new BufferedOutputStream(socket.getOutputstream()) //get the socket stream to write stream.write(content);//write everything that needs to be written stream.flush();//if flush() is not called, could result in server getting incomplete packets resulting in hangs!!! 

Suggestions:

a) Go through the .net client code. Make sure that any part of the code that actually communicates with the server correctly calls the equivalent flush () or cleanup methods. Note. From their documentation, I saw that their transport level is determined by flash (). You should scan your .net code and see if it uses transport methods. http://thrift.apache.org/docs/concepts/

b) For further debugging, you can try writing a small Java client that mimics your .net client. Run the java client on your Linux machine (the same machine where TSimpleServer is running). See if this causes the same problem. If so, you can debug your java client and find the root cause. If this is not the case, you can run it where your .net client works, and see if there are any problems and take it from there.

Edit: c) I was able to see an example of economical client code in Java here: https://chamibuddhika.wordpress.com/2011/10/02/apache-thrift-quickstart-tutorial/ I noticed transport.open (); // make code transport.close (); As suggested in a), you could go though your client code is .net and see if you call the transport methods flush () and close () at the end

+3
source

Hiding the Thrift service to the wildcard address ("0.0.0.0") solved the problem, no longer hung.

Using a multi-threaded server will make the application more responsive, but still lead to freeze / incomplete requests.

If someone stumbles on this issue and can provide a more complete explanation and how it relates to the Java RMI TCP callbacks problem (which I linked in my question), you can use one for you.

0
source

I have a similar C ++ server / client environment.

The C ++ client calls the (attributeDefinitionsAliases) method and waits for a response.

The c ++ server starts writing to the socket, but is blocked. Wire Shark Capture:

enter image description here After closing the c ++ client, the exception is thrown on the c ++ server:

Thrift internal message: TSocket :: write_partial () send (): errno = 10054

Thrift internal message: TConnectedClient died: write () send (): errno = 10054

UPDATE 1: This is not an economical issue. It seems the problem is how the server starts / starts. I have an application (launcher-app) that starts / starts the server with QProcess ( https://doc.qt.io/archives/qt-4.8/qprocess.html ) using popen works fine .

0
source

All Articles