How to warn about the health of the Kubernete cluster?

We launch the Kubernetes guest cluster on Google Cloud (GKE) and clear it using Prometheus.

My question is similar to this one , but I would like to know what are the most important indicators to look for in the K8s cluster and maybe

It is rather K8, and then the question of Prometheus, but I would really appreciate some hints. Please let me know if my question is fuzzy, so I can improve it.

+5
source share
1 answer

etcd is the basis of the Kubernetes. Therefore, having a good set of warnings is important to him. We wrote this blog post and created alert rules for it and provided a basic set at the end.

Other sources of important metrics in the Prometheus format are Kubelet and cAdvisor, API servers and the fairly new kube-state-metrics . For those, unfortunately, I do not know any public warning rule sets, for example, for etcd.

Typically, you want to make sure that components as applications work flawlessly, for example:

  • Are my kubelets / API servers available? ( up metric)
  • Are their response latency and error rate within boundaries?
  • Are API servers available etc.

Then there is an aspect of Kubernes's business logic, for example:

  • Are there containers that were in a non-ready state / crashloop permanently?
  • Do I have enough CPU / memory in my cluster?
  • Are my deployment replica expectations fulfilled?

Unfortunately, no fault tolerant solutions, but when writing alert rules that roughly cover the scope of the above examples, you should get far enough.

+5
source

All Articles