Kafka Producer Timeout Exceptions and Exceptions

We get random NetworkExceptions and TimeoutExceptions in our production environment:

 Brokers: 3 Zookeepers: 3 Servers: 3 Kafka: 0.10.0.1 Zookeeeper: 3.4.3 

We sometimes get this exception in my producer logs:

10 entries expire for TOPIC: XXXXXX: 5608 ms has passed since the creation time plus the wait time.

The number of milliseconds in such error messages continues to change. Sometimes its ~ 5 seconds; in other cases, it is up to ~ 13 seconds!

And very rarely do we get:

 NetworkException: Server disconnected before response received. 

The cluster consists of brokers 3 and 3 zookeepers. The manufacturers server and the Kafka cluster are on the same network.

I am making synchronous calls. There is a web service that is accessed by several user requests to send their data. The Kafka web service has one manufacturer object that performs the entire dispatch. The manufacturer's request timeout was 1000 ms, which was changed to 15000 ms (15 seconds). Even after increasing the waiting period, TimeoutExceptions are still displayed in the error logs.

What could be the reason?

+13
java apache-kafka
source share
4 answers

It's a little hard to find the root cause, I will share this experience, hope someone can find it useful. In general, it may be a network problem or excessive flooding in conjunction with ack=ALL . Here is a diagram that explains the TimeoutException from Kafka KIP-91 at the time of its writing (still applicable until 1.1.0):

enter image description here

Excluding network configuration problems or errors, these are properties that you can configure depending on the scenario to mitigate or solve the problem:

  • buffer.memory controls the total amount of memory available to the manufacturer for buffering. If records are sent faster than they can be transferred to Kafka, then this buffer will also be exceeded, then additional send calls are blocked until max.block.ms , after which Producer throws a TimeoutException .

  • max.block.ms already has a high value, and I do not propose to increase it. buffer.memory has a default value of 32 MB, and depending on the size of the message, you can increase it; if necessary, increase the heap space of the JVM.

  • Retries determine how many attempts to resubmit the entry in the event of an error before failing. If you use zero retries, you can try to alleviate the problem by increasing this value, beware of the order of the entries unless you set max.in.flight.requests.per.connection to 1.

  • Records are sent as soon as the packet size is reached or the timeout period has expired, whichever comes first. if batch.size (16 KB by default) is less than the maximum request size, perhaps you should use a higher value. Also, change linger.ms to a higher value, such as 10, 50, or 100, to optimize package usage and compression. this will result in less flooding in the network and optimized compression if you use it.

There is no exact answer to this type of questions, since they also depend on the implementation; in my case, an experiment with the above values ​​helped.

+15
source share

We faced a similar problem. Many NetworkExceptions in logs and from time to time TimeoutException .

Cause

After we collected the TCP protocols from production, it turned out that some TCP connections with Kafka brokers (we have 3 broker nodes) were dropped without notifying clients after about 5 minutes of downtime (without FIN flags at the TCP level). When the client tried to reuse this connection after this time, the RST flag was returned. We could easily map the reset of these connections in TCP logs to NetworkExceptions in application logs.

As for the TimeoutException , we could not make the same comparison as by the time we found the reason, this type of error no longer occurred. However, in a separate test, we confirmed that breaking a TCP connection could also lead to a TimeoutException . I assume this is due to the fact that the Java Kafka Client uses the Java NIO Socket Channel. All messages are buffered and then sent as soon as the connection is ready. If the connection is not ready within a timeout (30 seconds), the messages will expire, which will result in a TimeoutException .

Decision

For us, the reduction of connections.max.idle.ms on our clients to 4 minutes was fixed. As soon as we applied it, NetworkExceptions disappeared from our magazines.

We are still investigating what is breaking connections.

Edit

The cause of the problem was the AWS NAT gateway, which dropped outgoing connections after 350 seconds.

https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout

+2
source share

Solution 1

Edit

 listeners=PLAINTEXT://hostname:9092 

in the server.properties file for

 listeners=PLAINTEXT://0.0.0.0:9092 

Decision 2

Change the broker.id value to 1001, change the broker identifier by setting the environment variable KAFKA_BROKER_ID .

You will need to set the environment variable KAFKA_RESERVED_BROKER_MAX_ID about 1001 in order to allow setting the broker ID to 1001.

I hope this helps

0
source share

Increase request.timeout.ms and repetitions of your manufacturer

0
source share

All Articles