In this article,  we will learn how Kubernetes Liveness Probes and Kubernetes Readiness Probes help us improve the continuity and availability of Kubernetes Services. For that, we will learn how Kubernetes uses probes to decide, whether an application is up and running and act accordingly:

  • liveness probes – checks, whether a long-running application is still responsive
  • readiness probes – checks, whether a starting application is responsive already

References

0. Preparation

Step 0.1: Access the Kubernetes Playground

As always, we start by accessing the Katacoda Kubernetes Playground.

Step 0.2 (optional): Configure auto-completion

The Katacoda Kubernetes Playground has defined the alias and auto-completion already. Only in case you are running your tests in another environment, we recommend to issue the following two commands:

alias k=kubectl
source <(kubectl completion bash)

However, even in the case of the Katacoda Kubernetes Playground, auto-completion does not work for the alias k for yet. Therefore, we  need to type the following command:

source <(kubectl completion bash | sed 's/kubectl/k/g')

Once this is done, k g<tab> will be auto-completed to k get and k get pod <tab> will reveal the name(s) of the available POD(s).

1. Liveness Probes

Liveness probes can help to detect POD/containers that are not fully functional. Once, the application is not responsive anymore, Kubernetes can restart the corresponding POD, and with that Kubernetes can potentially auto-heal the problem.

We now will test, what happens, if a POD of a Kubernetes Deployment becomes unresponsive after some time. For that, we will start a deployment with two parallel processes, one of which is an NginX server and the other is a while loop that keeps the container up and running, even if we manually kill the NginX process, thus making the container unresponsive.

Kubernetes Liveness Probe -- Kublet restarting non-responsive POD

Step 1.1: Create Deployment with two Processes and a Service

Let us create a Deployment with a shell script that starts an NginX server in the background and a while loop that keeps the container up and running, even if the NginX service fails. We also create a service, so we can verify, whether the NginX server is responsive or not. The liveness Probe is commented out for now:

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  # modify replicas according to your case
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/sh"] 
        args:
        - -c
        - >
          nginx -g 'daemon off;' & while true; do sleep 10; done
#        livenessProbe:
#          httpGet:
#            path: /
#            port: 80
#          initialDelaySeconds: 5
#          periodSeconds: 5
EOF

# output:
# service/nginx created
# deployment.apps/nginx created

After 60 sec, you should be able to access the NginX Server on the service endpoint:

kubectl get svc
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP      10.96.0.1                443/TCP        64m
nginx        LoadBalancer   10.111.191.121   172.17.0.75   80:32244/TCP   10m

CLUSTER_IP=$(kubectl get svc | grep nginx | awk '{print $3}')
curl -v -s $CLUSTER_IP 2>&1 | head -n 10
* Rebuilt URL to: 10.111.191.121/
*   Trying 10.111.191.121...
* Connected to 10.111.191.121 (10.111.191.121) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.111.191.121
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.17.5

If you got the following message, wait up to 60 sec and try again:

connect to 10.96.228.235 port 80 failed: Connection refused

If it does not work after 2 or 3 minutes, then search for the POD name and look at the logs with kubectl logs <POD>.

Above, we have received a „200 OK“ from the NginX server. You also can view the HTML message only by omitting the pipe to the head command, if you wish.

Step 1.2: Kill the NginX Server Process

Now let us kill the NginX process to simulate a broken POD, that still lives in the sense that the container is up and running. For that, let us find the process ID. The ps command is not available, so we need to install it, or we can look at the /proc folder (we have chosen the latter):

POD=$(k get pod | grep nginx | awk '{print $1}')
kubectl exec $POD -- bash -c 'find /proc -mindepth 2 -maxdepth 2 -name exe -exec ls -lh {} \; 2>/dev/null'

# output:
...
lrwxrwxrwx 1 root root 0 Nov  6 06:26 /proc/6/exe -> /bin/nginx
...

Here, we can see that the process ID of the nginx process is 6. Let us kill the process now:

kubectl exec $POD -- bash -c 'kill 6'

Now the service is unresponsive:

curl -s -v 10.96.228.235 2>&1 | head -n 10
* Rebuilt URL to: 10.96.228.235/
*   Trying 10.96.228.235...
* connect to 10.96.228.235 port 80 failed: Connection refused
* Failed to connect to 10.96.228.235 port 80: Connection refused
* Closing connection 0

However, the POD is not restarted. It is still the old POD:

kubectl get pod
NAME                    READY   STATUS    RESTARTS   AGE
nginx-7b7698b7b-q4ctz   1/1     Running   0          30m

Thus, the NginX server has stopped working properly, but there is no self-healing process either. Let us fix that now.

Step 1.3: Add a Liveness Probe

We re-deploy the Kubernetes deployment with the Livenness Probe configured:

cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  # modify replicas according to your case
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/sh"] 
        args:
        - -c
        - >
          nginx -g 'daemon off;' & while true; do sleep 10; done
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
EOF

# output:
# deployment.apps/nginx configured

The POD has been replaced and the NginX server is responsive again (after a minute or so):

kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5b4955cf69-dkc6f   1/1     Running   0          38s

curl -v -s $CLUSTER_IP 2>&1 | head -n 10
* Rebuilt URL to: 10.96.228.235/
*   Trying 10.96.228.235...
* Connected to 10.96.228.235 (10.96.228.235) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.96.228.235
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.17.5

Step 1.4: Kill the NginX Process

We kill the nginx process again:

kubectl exec $POD -- bash -c 'find /proc -mindepth 2 -maxdepth 2 -name exe -exec ls -lh {} \; 2>/dev/null' | grep nginx
lrwxrwxrwx 1 root root 0 Nov  6 06:53 /proc/6/exe -> /usr/sbin/nginx
lrwxrwxrwx 1 nginx nginx 0 Nov  6 06:53 /proc/8/exe

kubectl exec $POD -- bash -c 'kill 6'

Now the NginX server is unresponsive again:

curl -s -v 10.96.228.235 2>&1 | head -n 10
* Rebuilt URL to: 10.96.228.235/
*   Trying 10.96.228.235...
* connect to 10.96.228.235 port 80 failed: Connection refused
* Failed to connect to 10.96.228.235 port 80: Connection refused
* Closing connection 0

Step 1.5: Wait for the Service to heal itself

However, after a minute or so, you will see that the POD is restarted. The Liveness probe has failed and thus, the Kubernetes has decided that the POD has failed and will restart it:

kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5b4955cf69-dkc6f   1/1     Running   0          5m15s

kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5b4955cf69-dkc6f   1/1     Running   1          6m

After this, the NginX server is responsive again:

curl -s -v $CLUSTER_IP 2>&1 | head -n 10
* Rebuilt URL to: 10.96.228.235/
*   Trying 10.96.228.235...
* Connected to 10.96.228.235 (10.96.228.235) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.96.228.235
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.17.5

With the help of the Liveness Probe, the problem has healed itself. Any POD that does not respond properly is restarted automatically.

2. Readiness Probes

Readiness probes, in turn, are used to detect how long a container needs for booting up the application properly. Without a readiness probe, the POD’s endpoint might be added to a load-balancer too early and thus degrading the service. Moreover, a rollout of a non-functional version of a ReplicaSet or Deployment will stop the rollout after detecting that the first new POD is non-functional. With that, the service is not degraded, giving the administrator a chance to mitigate the problem.
Kubernetes Readiness Probes -- if answered correctly,, POD is added to list of Service Endpoints

Step 2.1: Explore Problems with Containers that have a long Boot-Time

In this sub-chapter, we want to see, what happens to an exposed service, if a POD/container starts with a long boot time. For this, we create a service with two PODs, one of which has a boot time of 120 sec. We simulate this situation by running a 120 seconds sleep command before we run the actual application, i.e. an NginX server.

cat <<EOF | kubectl apply -f - 
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: nginx
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx-slowboot
  labels:
    app: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    command: ["/bin/sh"]
    args: ["-c", "sleep 60; nginx -g 'daemon off;'"]
#    readinessProbe:
#      httpGet:
#        path: /
#        port: 80
#      initialDelaySeconds: 5
#      periodSeconds: 5
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx-quickboot
  labels:
    app: nginx
spec:
  containers:
  - name: nginx-quickboot
    image: nginx
#    readinessProbe:
#      httpGet:
#        path: /
#        port: 80
#      initialDelaySeconds: 5
#      periodSeconds: 5
EOF

For now, we have commented out the readiness probe.

Right after the start of the service and the PODs, we copy and paste the following code to the master:

CLUSTER_IP=$(kubectl get svc | grep nginx | awk '{print $3}')
sleep 15
kubectl get ep
while true; do curl -s $CLUSTER_IP -o /dev/null && echo success || echo failure; sleep 2; done

The output will look similar to follows.

# output:
master $ kubectl get ep
NAME         ENDPOINTS                   AGE
kubernetes   172.17.0.25:6443            117m
nginx        10.32.0.2:80,10.32.0.3:80   15s
master $ while true; do curl -s $CLUSTER_IP -o /dev/null && echo success || echo failure; sleep 2; done
failure
success
failure
failure
failure
failure
success
success

Therefore, statistically, every second curl request will fail. This is because the HTTP requests are load-balanced between the POD that is working already and the one, which is still in the boot process, leading to a „failed“ status. The problem is caused by both endpoints being  added the service right away, even though one of the PODs is not responsive yet:

kubectl get ep
NAME         ENDPOINTS                   AGE
nginx        10.40.0.2:80,10.40.0.3:80   35s

If you wait long enough (> 2 minutes), then all curl commands will be successful again, indicating that the slower nginx POD is ready as well.

In the next step, we will improve the initialization procedure by adding a readiness probe.

Step 2.2: Add a Readiness Probe

We now will remove the slow POD:

kubectl delete pod nginx-slowboot

.We re-create the POD with a readiness probe now:

cat <<EOF | kubectl apply -f - 
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx-slowboot
  labels:
    app: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    command: ["/bin/sh"]
    args: ["-c", "sleep 60; nginx -g 'daemon off;'"]
    readinessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
EOF

Again, we copy & paste the following code to the master’s console:

sleep 15
kubectl get ep
while true; do curl -s $CLUSTER_IP -o /dev/null && echo success || echo failure; sleep 2; done

Thus time, the output will look much better than before:

master $ sleep 15
master $ kubectl get ep
NAME         ENDPOINTS          AGE
kubernetes   172.17.0.25:6443   120m
nginx        10.32.0.3:80       3m8s
master $ while true; do curl -s $CLUSTER_IP -o /dev/null && echo success || echo failure; sleep 2; done
success
success
success
success
success
success

All curl requests will be successful. The reason is, that the endpoint of the slowly booting POD is not added to the list of the endpoints of the service before it successfully replies to the readiness probe. This way, you never will create a black hole for the HTTP requests.

If you wait for more than 60 seconds, stop the while loop with <ctrl>-C and look for the list of endpoints, the second endpoint will be available, though:

kubectl get ep
NAME         ENDPOINTS                   AGE
kubernetes   172.17.0.25:6443            122m
nginx        10.32.0.2:80,10.32.0.3:80   5m16s

3. ReplicaSets and Readiness Probes

We now will show, whether a readiness probe can help in the case of a rollout with misconfigured Deployments.

Step 3.1: Create a working ReplicaSet with Readiness Probe

cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  # modify replicas according to your case
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/sh"] 
        args: ["-c", "nginx -g 'daemon off;'"]
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
EOF

After 60 sec or so, the PODs are considered to be up and running:

kubeclt get pods
NAME          READY   STATUS    RESTARTS   AGE
nginx-557fr   1/1     Running   0          62s
nginx-j6zsb   1/1     Running   0          62s
nginx-q682c   1/1     Running   0          62s

Now we do the same, but be make sure the nginx never starts by just adding a long-living sleep command instead of the nginx command:

cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: nginx-not-starting
  labels:
    app: nginx-not-starting
spec:
  # modify replicas according to your case
  replicas: 3
  selector:
    matchLabels:
      app: nginx-not-starting
  template:
    metadata:
      labels:
        app: nginx-not-starting
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/sh"] 
        args: ["-c", "sleep 3600"]
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
EOF

Even after more than two minutes, the PODs are still not considered to be READY, even though the STATUS is „Running“:

kubectl get pods
NAME                       READY   STATUS    RESTARTS   AGE
nginx-557fr                1/1     Running   0          6m54s
nginx-j6zsb                1/1     Running   0          6m54s
nginx-not-starting-dbnwh   0/1     Running   0          2m14s
nginx-not-starting-mkg7l   0/1     Running   0          2m14s
nginx-not-starting-wpqfw   0/1     Running   0          2m14s
nginx-q682c                1/1     Running   0          6m54s

As a summary, we can see that non-responsive ReplicaSets with readiness probes never get READY, even if the STATUS is „Running“.

How can we make use of the fact, that non-working ReplicaSets never become „READY“, when using deployments? This will be investigated in the next chapter.

4. Deployments and Readiness Probes

In this subchapter, we will show how we can make use of the fact, that non-responsive ReplicaSets never enter the „READY“ state.

Step 4.1: Create a working Deployment

We restart the Kubernetes Playground, so it will be clean again:

https://www.katacoda.com/courses/kubernetes/playground#

We start a Deployment with a working NginX application:

cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
  labels:
    app: nginx-deploy
spec:
  # modify replicas according to your case
  replicas: 3
  selector:
    matchLabels:
      app: nginx-deploy
  template:
    metadata:
      labels:
        app: nginx-deploy
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/sh"] 
        args: ["-c", "nginx -g 'daemon off;'"]
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
EOF

After a minute or so, the PODs are READY:

kubectl get pod
NAME                            READY   STATUS    RESTARTS   AGE
nginx-deploy-687c989cb9-8pg6m   1/1     Running   0          23s
nginx-deploy-687c989cb9-g2d97   1/1     Running   0          23s
nginx-deploy-687c989cb9-psdst   1/1     Running   0          23s

Step 4.2: Try to break the Deployment

Now let us edit the deployment and replace the command:

kubectl edit deploy nginx-deploy
...
containers:
- args:
  - -c
  - nginx -g 'daemon off;'   # <----------- replace <nginx ...> by <sleep 3600>
# output after saving: deployment.extensions/nginx-deploy edited

Here in pictures. Before, we see:

POD with nginx command

Afterward, it looks as follows:

POD with sleep command

Step 4.3: Observe that the old Deployment PODs are kept up and running

Now, a new POD is started:

kubectl get pod
NAME                            READY   STATUS    RESTARTS   AGE
nginx-deploy-59d6f9bcc5-rcfz5   0/1     Running   0          47s
nginx-deploy-687c989cb9-8pg6m   1/1     Running   0          6m31s
nginx-deploy-687c989cb9-g2d97   1/1     Running   0          6m31s
nginx-deploy-687c989cb9-psdst   1/1     Running   0          6m31s

However, the new POD will never be READY, so the old, working PODs will not be removed in a rolling update. Two minutes later, nothing has changed:

kubectl get pod
NAME                            READY   STATUS    RESTARTS   AGE
nginx-deploy-59d6f9bcc5-rcfz5   0/1     Running   0          2m12s
nginx-deploy-687c989cb9-8pg6m   1/1     Running   0          7m56s
nginx-deploy-687c989cb9-g2d97   1/1     Running   0          7m56s
nginx-deploy-687c989cb9-psdst   1/1     Running   0          7m56s

Moreover, if there is a service that exposes the deployment, the traffic will only be distributed among the old, working PODs.

Thus, the existing services will not be destroyed and the administrator has time to troubleshoot and correct the problem he has introduced with the new Deployment version.

Summary

In this article, we have learned how to use Kubernetes Readiness and Liveness Probes to mitigate application problems during startup and normal operation.

Kubernetes Readiness probes make sure, that applications, that are not ready yet, do not disturb any existing services. They also help to avoid that misconfigured, non-responsive PODs replace old, still functional PODs in production.

Kubernetes Liveness probes monitor long-running application PODs and restart those if they become unresponsive.

2 comments

  1. Palo Alto Networks Certified Security Engineer (PCNSE) PAN-OS 10.2 exam requires you to make sure a clear, profound and accurate understanding of the subjects covered in the exam syllabus. The most important thing to pass this exam is to access a study material that provides you exam-oriented, simplified and authentic information that is primary requirement of Paloalto Networks Exam PCNSE.

Comments

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.