I run a program with a high degree of recording (10 peak peaks when recording 25 fps) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of type c4.2xlarge: 8 vcore and 15G ram)
From time to time, my Java client using the DataStax driver 3.0.2 would get a problem with the write timeout:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write) at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73) at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26) at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
The error occurs infrequently and very unpredictably. So far, I can’t associate errors with anything specific (for example, the program’s operating time, disk data size, time of day, system load indicators, such as processor, memory, network indicators). However, this does disrupt our operations.
I am trying to find the root cause of the problem. Looking on the Internet for options, I'm a little overwhelmed with all the findings, such as
- Change "write_request_timeout_in_ms" to "cassandra.yaml" (already changed to 5 seconds)
- Using the correct "RetryPolicy" to continue the session (already using DowngradingConsistencyRetryPolicy at the level of consistency of the level of one level)
- Change cache size, heap size, etc. - Never tried those b / c, there are good reasons to minimize them as the main reason.
During my research, I really got confused that I was getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:
- I have a fully replicated 24 node cluster covering 5 aws areas. Each region has at least 2 copies of data.
- My program executes ONE consistency level at the session level (Cluster builder with QueryOption)
- When an error occurs, no more than three (3) hosts are registered on our graphic diagram, i.e. relevant Cassandra.ClientRequest.Write.Timeouts.Count
- I already set write_timeout for 5 seconds. The network is pretty fast (using iperf3 to check) and stable
On paper, the situation should be within the Cassandra fault tolerance range. But why has my program still failed? Are the numbers not what they seem?
source share