Client socket connections that refuse a server on a Windows host for a small number (16 <x <24) of simultaneous attempts to connect to a client
We encountered a problem when our incoming client socket connections to our socket server refused, when a relatively small number of nodes (from 16 to 24, but we will need to handle more in the future) try to connect at the same time.
Some features:
- the server runs on Windows 2008 or 7
- our main server is written in Java using ServerSocket
- customers also work with grid nodes in our data center
When we try to perform a test run in the grid, client nodes try to connect to the server and send a 40-100K packet, and then delete the connection. Using between 16 and 24 nodes, we begin to encounter problems when client connections cannot connect to the server. Given this setting, we are trying to potentially handle a maximum of 16-24 simultaneous client connections and failures, which seems to us not quite right.
The main server loop is tapped on a regular SocketServer, and when it receives a connection, it generates a new thread to handle the connection, immediately returning to listen on the socket. We also have a dummy python server that simply reads and discards incoming data and a C ++ server that registers the data before flushing it, and both of them also experience the same problem that clients cannot connect to minor changes in the amount successful client connections before failures begin. This led us to the fact that any particular server is not mistaken in this matter and that it is probably ecological.
Our first thoughts were to reboot the TCP drive in the socket. This did not mitigate the problem, even when it was promoted to very high levels. The default value for Java SocketServer is 50, much lower than we can handle.
We ran a test between machines on the same subnet and turned off all local firewalls on the machines if FW does speed that limits our connections to the server; no success.
We tried to configure the network on a computer running Windows:
- TimedWaitDelay time reduction, but without effect (and in my Python test this should not, because this test only works for a few milliseconds).
- Increasing MaxUserPort to a large value of about 65000, but without effect (which is odd, since my Python test only ever sends 240 messages, so I should not even come close to this type).
- Increase TcpNumConnection to a large value (do not remember the exact number). Again, we should never have more than 24 connections at a time, so this cannot be the limit.
- Launching the function "Dynamic lag", which allows you to dynamically increase the volume of messages. I think that we have established max-2000 connections with a minimum of 1000 connections, but this did not affect. Again, Python should never make more than 240 connections, so we should not even activate dynamic lag.
- In addition to the above disabling Windows "auto-tuning" for TCP ports. Again, no effect.
I feel that Windows is somehow limiting the number of incoming connections, but we are not sure what needs to be changed to provide more connections. The agentβs thoughts on the network that limit connection speed also seem untrue. We highly doubt that the number of concurrent connections overloads the physical GB network.
We are at a dead end. Has anyone else encountered such a problem and found a solution?
I would check how many connections are in TIME_WAIT TCP connection state. I saw this type of problem due to the fact that many connections were open / closed, causing socket exhaustion due to TIME_WAIT. To test it, run:
netstat -a IIS is known to handle a large number of concurrent incoming connections β much more than the limit you are experiencing β which makes the environment unlikely.
If, as you point out, increasing TCP lag does not improve the situation, the problem should really be in the behavior of accept (). You do not indicate whether customers receive various errors or something consistent. Timeouts would support this, while deviations would mean that the backlog is not processing fast enough.
Can you try a prototype application as an ASPX host to better understand the problem?
Most likely you are limited to the OS; Do you see message 4226 in your system logs?
Windows limits the number of attempts to connect parallel to (I think) 10 connections per second - depending on the OS version (server versions have a value of up to 50)
To eliminate this, you have two options:
Direct editing of tcpip.sys in system32 / drivers with a hex editor is a joke :)
try to edit the entry [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ Lanmanserver \ Parameters \ MaxMpxCt (default = 10 commands).
You can also try this fix if you are using a version that does not allow you to set this parameter.
You can also try various things, such as the maximum number of TCBs used by the OS, the port range for dynamic port allocation, etc. - although these values ββare high enough for your needs.