etcd is the basis of the Kubernetes. Therefore, having a good set of warnings is important to him. We wrote this blog post and created alert rules for it and provided a basic set at the end.
Other sources of important metrics in the Prometheus format are Kubelet and cAdvisor, API servers and the fairly new kube-state-metrics . For those, unfortunately, I do not know any public warning rule sets, for example, for etcd.
Typically, you want to make sure that components as applications work flawlessly, for example:
- Are my kubelets / API servers available? (
up metric) - Are their response latency and error rate within boundaries?
- Are API servers available etc.
Then there is an aspect of Kubernes's business logic, for example:
- Are there containers that were in a non-ready state / crashloop permanently?
- Do I have enough CPU / memory in my cluster?
- Are my deployment replica expectations fulfilled?
Unfortunately, no fault tolerant solutions, but when writing alert rules that roughly cover the scope of the above examples, you should get far enough.
fabxc source share