Let’s say you have a container living in Kubernetes. It got a node associated, IP assigned and lives happily somewhere in the cluster. Everything’s perfect until .. something brakes. Your application just deadlocked on something, or ran into a state when it’s simply not responding to requests anymore. Other services start timeouting while trying to talk to you.. So you bring up the dashboard and Kubernetes seems to be convinced that your deployment is healthy. Why ?
Well, as far as Kubernetes is concerned, your containers didn’t crash so it’s reasonable to treat the pod as alive. If it’d just die because of e.g. unhandled exception, it’ll get restarted by the orchestrator. But, what we are dealing with is what I like to call “zombie process” - it’s not dead, but not alive either. Every request ends up with either 500 or timeout.
Can we improve the situation somewhat ?
Readiness and liveness check
Thankfully, Kubernetes has very flexible healthchecking mechanism built in. Basically, you can define how health of your container should be checked by the controller and what does it take to treat it as alive. And you can do it on two levels:
- Readiness probe
When your pod is starting, k8s will wait for readiness probe of all it’s containers to be successful. Only then it’ll consider the pod as healthy and plug it into
Service
(load balancer). This is meant to check if your process is ready to start accepting requests. - Liveness probe This is used to check, whether your application is still alive and operational. Be aware of setting this too tight, as when it fails due to e.g. temporary spike in load your container will be restarted over and over again.
These probes get particularly useful when combined with rolling update (Deployment
resource). Imagine you’re deploying V2 of your application but unfortunately it cannot get up, because e.g. config is messed up. If you have configured your readiness probes properly, there will be no downtime introduced at all. V2 will be started “on the side” and since it failed the readiness probe, it won’t be plugged to the load balancer therefore the old and working version will stay. Awesome, isn’t it ?
Specifying them is as simple as appending few lines of YAML to your container spec:
apiVersion: v1
kind: Pod
metadata:
name: my-awesome-pod
spec:
containers:
- name: hello-container
image: hello_asp_net_core:latest
livenessProbe:
httpGet:
path: /liveness
port: 80
readinessProbe:
httpGet:
path: /readiness
port: 80
This will make the controller call /liveness
and /readiness
endpoints using HTTP in order to check the health of your containers. If it returns anything else but 200 OK
or timeouts it’d be considered as failing the probe.
Different types and configuration
Since specifying the failure and “liveness” criteria is crucial to service availability, no wonder Kubernetes offers a lot of flexibility when it comes to specifying them. The most basic probe you can define is an HTTP call, which is an obvious choice for all the REST APIs. All it takes to implement them is to have two endpoints to be called by the controller. When 200
is received, the check is considered to be successful.
Also, there are TCP checks built in. You specify the port, and if the socket can be opened on this port the check is considered successful. This could be useful as e.g. really simple check on a database. However, in case of databases it would quickly turn out that TCP check is not enough. After all, if you can open the socket but no queries can be performed, it shouldn’t be treated as healthy.
This is where command based checks kick in. Basically, they allow you to specify any custom command to be executed. Then, the status code is checked - if it’s zero, it means everything is fine. So, for example you might use mongo client and call db.status() to check for it’s readiness/liveness. Or, if you are dealing with e.g. gRPC API you might also check it’s liveness this way. Quite handy!
All of this is set per container and in order to consider the pod as ready/alive, all it’s containers must pass the probes. Also, there are a lot of “knobs” allowing you to fine tune your probes:
initialDelaySeconds
- how many seconds to wait until the first readiness / liveness check. Defaults to zero.periodSeconds
- the interval at which you perform the check. If you set it to one, it’ll ping the application each second. Default: 10.timeoutSeconds
- how long to wait for response before you consider the check to be failed. Defaults to onesuccessThreshold
- how many consecutive checks need to succeed for the probe to be considered successful. Defaults to one.failureThreshold
- how many times to check before giving up. For example, if set to 3 Kubernetes will check three times and then restart the pod. In context of readiness check, this specifies how many consecutive checks need to fail before considering pod as “Unready”.
For example, our manifest could look like:
apiVersion: v1
kind: Pod
metadata:
name: my-awesome-pod
spec:
containers:
- name: hello-container
image: hello_asp_net_core:latest
livenessProbe:
httpGet:
path: /liveness
port: 80
initialDelaySeconds: 5
periodSeconds: 30
timeoutSeconds: 2
successThreshold: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /readiness
port: 80
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 2
successThreshold: 5
failureThreshold: 2
Summary
Today we’ve learnt how the orchestrator checks for liveness and readiness of the containers. We got to know different parameters to control the behavior of those checks and we briefly iterated different kinds of checks possible to perform out of the box within Kubernetes. Setting those checks will make your service more resilient and can make you sleep better, as things will just restart themselves when it’s necessary.
In the next post, we are going to focus on how to split resources between your pods, while ensuring the critical services have resources reserved. Stay tuned!
Comments