In this blog post, we will have a close look at Kubernetes labels and node selectors. Labels and node selectors are used to control, which PODs are scheduled on which set of nodes.

We will go into the details on what happens to PODs with node selectors that do not match any node. Moreover, we will investigate, what happens to the already running PODs if a label is removed from a node.

Scheduler's reaction on Kubernetes Labels and Node Selectors

tldr;

Impatient readers may be interested in the Summary section below.

References

Step 0: Preparations

Step 0.1: Access the Kubernetes Playground

As always, we start by accessing the Katacode Kubernetes Playground. We press the launch.sh command and type the following command, so we can use auto-completion together with the predefined alias k for kubectl:

source <(kubectl completion bash) # optional, because katacoda has done that for us already

alias k=kubectl # optional, because katacoda has done that for us already
complete -F __start_kubectl k

As an example, k g<tab> will be completed to k get. Also nice: k get pod <tab> will reveal the name of the POD(s).

Note: the last command found in the official kubernetes documentation can also be replaced by the command we had used in the previous blog posts:

source <(kubectl completion bash | sed 's/kubectl/k/g')

Step 0.2: Schedule PODs also on the Master

The Katacoda k8s playground provides us with a single master and a worker node. However, we want to study the effect of labels, taints, and tolerations. For that, we need at least two worker nodes. To achieve that without adding a real node, we make sure that PODs can be deployed on the master as well. Therefore, we need to remove the NoSchedule taint of the master:

# check that both nodes are available:
kubectl get nodes
NAME     STATUS   ROLES    AGE   VERSION
master   Ready    master   31m   v1.14.0
node01   Ready       31m   v1.14.0

# check the taints before removal:
kubectl describe nodes | egrep "Name:|Taints:"
Name:               master
Taints:             node-role.kubernetes.io/master:NoSchedule
Name:               node01
Taints:             <none>

# remove taint:
kubectl taint node master node-role.kubernetes.io/master-
node/master untainted

# check the taints after removal:
kubectl describe nodes | egrep "Name:|Taints:"
Name:               master
Taints:             <none>
Name:               node01
Taints:             <none>

Task 1: Run PODs on dedicated Nodes

In this phase, we show how we can use the POD’s nodeSelector to control, on which Kubernetes node the POD is running on.

Step 1.1: Check Labels of Nodes

As a first step, we display the labels that are assigned to the nodes:

kubectl get nodes --show-labels
NAME     STATUS   ROLES    AGE   VERSION   LABELS
master   Ready    master   32m   v1.14.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
node01   Ready    <none>   32m   v1.14.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node01,kubernetes.io/os=linux

Very seldomly, we have seen node01 is missing. If this is the case, you need to restart the Katakoda session.

Step 1.2: Add Labels

kubectl label nodes master vip=true --overwrite
kubectl label nodes node01 vip=false --overwrite

Note: the overwrite option is only needed if the vip label is defined already. However, the option does not hurt otherwise.

Step 1.3: Verify Labels


kubectl get nodes --show-labels
NAME     STATUS   ROLES    AGE   VERSION   LABELS
master   Ready    master   32m   v1.14.0   ...,vip=true
node01   Ready    <none>   32m   v1.14.0   ...,vip=false

Step 1.4: Run PODs with a Node Selector

Now let us create four PODs with the nodeSelector vip: "true" in the POD spec:

cat <<EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: vip1
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "true"

---
apiVersion: v1
kind: Pod
metadata:
  name: vip2
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "true"

---
apiVersion: v1
kind: Pod
metadata:
  name: vip3
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "true"

---
apiVersion: v1
kind: Pod
metadata:
  name: vip4
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "true"

EOF

# output:
pod/vip1 created
pod/vip2 created
pod/vip3 created
pod/vip4 created

We now check the location of the PODs with the kubectl get pod command with the –o=wide option:

kubectl get pod -o=wide
NAME   READY   STATUS    RESTARTS   AGE     IP          NODE     NOMINATED NODE   READINESS GATES
vip1   1/1     Running   0          6m25s   10.32.0.3   master   <none>           <none>
vip2   1/1     Running   0          6m25s   10.32.0.5   master   <none>           <none>
vip3   1/1     Running   0          6m25s   10.32.0.2   master   <none>           <none>
vip4   1/1     Running   0          6m25s   10.32.0.4   master   <none>           <none>

That means that all four PODs with nodeSelector vip: "true" have been started on the machine with label vip=true.

Optionally, we now can clean the PODs from the cluster again:

kubectl delete pod vip1 vip2 vip3 vip4

# output:
pod "vip1" deleted
pod "vip2" deleted
pod "vip3" deleted
pod "vip4" deleted

Summary of Task 1

In this task, we have shown that nodeSelector vip: "true" can be used to control the location of the scheduled PODs. All PODs with this nodeSelector can run only on nodes, which are labeled with vip=true.

Task 2: Run PODs on any Node

In this phase, we investigate, how the PODs are distributed among the nodes when no nodeSelector is set on the PODs. We expect a more or less even distribution of the PODs among the nodes.

Step 2.0: Check Nodes and un-taint the Master

We have done this already on Task 1, but if you happen to run task 2 independently from task 1, you need to repeat the following command:

# optional
source <(kubectl completion bash)
source <(kubectl completion bash | sed 's/kubectl/k/g')
kubectl get nodes

# required
kubectl taint node master node-role.kubernetes.io/master-

Step 2.1: Run PODs with no Node Selector

We now create four PODs with no nodeSelector vip: "true" in the POD spec:

cat <<EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: no-vip-01             # <-------- changed
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
#  nodeSelector:              # <-------- commented out
#    vip: "true"

---
apiVersion: v1
kind: Pod
metadata:
  name: no-vip-02             # <-------- changed
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
#  nodeSelector:              # <-------- commented out
#    vip: "true"

---
apiVersion: v1
kind: Pod
metadata:
  name: no-vip-03             # <-------- changed
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
#  nodeSelector:              # <-------- commented out
#    vip: "true"

---
apiVersion: v1
kind: Pod
metadata:
  name: no-vip-04             # <-------- changed
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
#  nodeSelector:              # <-------- commented out
#    vip: "true"

EOF

# output:
pod/no-vip-01 created
pod/no-vip-02 created
pod/no-vip-03 created
pod/no-vip-04 created

We now check the location of the PODs:

kubectl get pod -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
no-vip-01   1/1     Running   0          22s   10.40.0.2   node01   <none>           <none>
no-vip-02   1/1     Running   0          22s   10.40.0.3   node01   <none>           <none>
no-vip-03   1/1     Running   0          22s   10.40.0.1   node01   <none>           <none>
no-vip-04   1/1     Running   0          22s   10.40.0.4   node01   <none>           <none>

This was not expected, but we can see, that all four PODs are started on node01.

However, we can prove that only a certain number of PODs is started on node01 before PODs are also started on the master. We do that with the following script. You can just cut&past it to the master:

aster (marked in blue):

createNonVipPod() {
cat <<EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: no-vip-$1
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
EOF
}

for i in $(seq -f '%02g' 5 29); do
  createNonVipPod $i
done

We expect the following output:

master $ createNonVipPod() {
> cat <<EOF | kubectl create -f -
> ---
> apiVersion: v1
> kind: Pod
> metadata:
>   name: no-vip-$1
> spec:
>   containers:
>   - name: sleep
>     image: busybox
>     args:
>     - sleep
>     - "1000000"
> EOF
> }
master $
master $ for i in $(seq -f '%02g' 5 29); do
>   createNonVipPod $i
> done
pod/no-vip-05 created
pod/no-vip-06 created
pod/no-vip-07 created
pod/no-vip-08 created
pod/no-vip-09 created
pod/no-vip-10 created
pod/no-vip-11 created
pod/no-vip-12 created
pod/no-vip-13 created
pod/no-vip-14 created
pod/no-vip-15 created
pod/no-vip-16 created
pod/no-vip-17 created
pod/no-vip-18 created
pod/no-vip-19 created
pod/no-vip-20 created
pod/no-vip-21 created
pod/no-vip-22 created
pod/no-vip-23 created
pod/no-vip-24 created
pod/no-vip-25 created
pod/no-vip-26 created
pod/no-vip-27 created
pod/no-vip-28 created
pod/no-vip-29 created

If we now check the location of the PODs, we can see, that some of the PODs are scheduled on the master as well. In the example below, the scheduler has chosen to schedule the 12th POD on the master:

kubectl get pods -o wide

# output:
NAME        READY   STATUS    RESTARTS   AGE     IP           NODE     NOMINATED NODE   READINESS GATES
no-vip-01   1/1     Running   0          2m56s   10.40.0.2    node01   <none>           <none>
no-vip-02   1/1     Running   0          2m56s   10.40.0.3    node01   <none>           <none>
no-vip-03   1/1     Running   0          2m56s   10.40.0.1    node01   <none>           <none>
no-vip-04   1/1     Running   0          2m56s   10.40.0.4    node01   <none>           <none>
no-vip-05   1/1     Running   0          65s     10.40.0.5    node01   <none>           <none>
no-vip-06   1/1     Running   0          65s     10.40.0.6    node01   <none>           <none>
no-vip-07   1/1     Running   0          64s     10.40.0.7    node01   <none>           <none>
no-vip-08   1/1     Running   0          64s     10.40.0.8    node01   <none>           <none>
no-vip-09   1/1     Running   0          64s     10.40.0.9    node01   <none>           <none>
no-vip-10   1/1     Running   0          64s     10.40.0.10   node01   <none>           <none>
no-vip-11   1/1     Running   0          64s     10.40.0.11   node01   <none>           <none>
no-vip-12   1/1     Running   0          63s     10.32.0.2    master   <none>           <none>
no-vip-13   1/1     Running   0          63s     10.40.0.12   node01   <none>           <none>
no-vip-14   1/1     Running   0          63s     10.40.0.13   node01   <none>           <none>
no-vip-15   1/1     Running   0          63s     10.40.0.14   node01   <none>           <none>
no-vip-16   1/1     Running   0          62s     10.32.0.3    master   <none>           <none>
no-vip-17   1/1     Running   0          62s     10.40.0.15   node01   <none>           <none>
no-vip-18   1/1     Running   0          62s     10.32.0.4    master   <none>           <none>
no-vip-19   1/1     Running   0          62s     10.40.0.16   node01   <none>           <none>
no-vip-20   1/1     Running   0          61s     10.40.0.17   node01   <none>           <none>
no-vip-21   1/1     Running   0          61s     10.40.0.18   node01   <none>           <none>
no-vip-22   1/1     Running   0          61s     10.32.0.5    master   <none>           <none>
no-vip-23   1/1     Running   0          60s     10.40.0.19   node01   <none>           <none>
no-vip-24   1/1     Running   0          60s     10.32.0.6    master   <none>           <none>
no-vip-25   1/1     Running   0          60s     10.40.0.20   node01   <none>           <none>
no-vip-26   1/1     Running   0          60s     10.32.0.7    master   <none>           <none>
no-vip-27   1/1     Running   0          59s     10.32.0.8    master   <none>           <none>
no-vip-28   1/1     Running   0          59s     10.40.0.21   node01   <none>           <none>
no-vip-29   1/1     Running   0          58s     10.40.0.22   node01   <none>           <none>

Summary of Task 2

In this task, we have shown that PODs are scheduled on both nodes, if there is no nodeSelector defined for the PODs, and no taint restricts the scheduling of the PODs.

Task 3: Run PODs with unmatching Node Selector

We now want to explore, what happens, if we try to start PODs with a nodeSelector that does not match any node.

Step 3.0: Check Nodes and un-taint the Master

We have done this already on Task 1 and 2, but if you happen to run the current task independently from the other tasks, you need to repeat the following command:

# optional
source <(kubectl completion bash)
source <(kubectl completion bash | sed 's/kubectl/k/g')
kubectl get nodes

# required
kubectl taint node master node-role.kubernetes.io/master-

Step 3.1: Run PODs with not (yet) matching Node Selector

Now let us create four PODs with the nodeSelector vip: "not-matching" in the POD spec:

cat <<EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: vip1
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "maybe"
---
apiVersion: v1
kind: Pod
metadata:
  name: vip2
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "maybe"
---
apiVersion: v1
kind: Pod
metadata:
  name: vip3
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "maybe"
---
apiVersion: v1
kind: Pod
metadata:
  name: vip4
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "maybe"
EOF

# output: 
pod/vip1 created
pod/vip2 created
pod/vip3 created
pod/vip4 created

We now check the location of the PODs with the kubectl get pod command with the –o=wide option:

kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
vip1   0/1     Pending   0          84s   <none>   <none>   <none>           <none>
vip2   0/1     Pending   0          84s   <none>   <none>   <none>           <none>
vip3   0/1     Pending   0          84s   <none>   <none>   <none>           <none>
vip4   0/1     Pending   0          84s   <none>   <none>   <none>           <none>

We can see that the PODs are in pending status since the scheduler has not yet found any matching node. Let us look more closely:

kubectl describe pod vip1

# output:
...<omitted lines>...
Node-Selectors:  vip=maybe
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  79s (x4 over 3m55s)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match node selector.

At the end of the output, you can see a message from the default scheduler that both nodes did not match the node selector.

Step 3.2: Label the Master to match the Node Selector of the PODs

Now let us observe, what happens, if a node with a matching label is created, while the PODs are in pending status:

kubectl label nodes master vip=maybe --overwrite
node/master labeled

If you are quick enough, you might see that the PODs are scheduled on the master immediately: the PODs  enter the ContainerCreating status immediately:

kubectl get pod -o wide
NAME   READY   STATUS              RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
vip1   0/1     ContainerCreating   0          10m   <none>   master   <none>           <none>
vip2   0/1     ContainerCreating   0          10m   <none>   master   <none>           <none>
vip3   0/1     ContainerCreating   0          10m   <none>   master   <none>           <none>
vip4   0/1     ContainerCreating   0          10m   <none>   master   <none>           <none>

After a few seconds, the PODs enter the Running status:

kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
vip1   1/1     Running   0          15m   10.32.0.2   master   <none>           <none>
vip2   1/1     Running   0          15m   10.32.0.5   master   <none>           <none>
vip3   1/1     Running   0          15m   10.32.0.3   master   <none>           <none>
vip4   1/1     Running   0          15m   10.32.0.4   master   <none>           <none>

Step 3.3: Remove the Label from the Master

Let us now explore, what happens with the PODs, if a matching label is removed from the master again:

kubectl label nodes master vip-
node/master labeled

kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
vip1   1/1     Running   0          19m   10.32.0.2   master   <none>           <none>
vip2   1/1     Running   0          19m   10.32.0.5   master   <none>           <none>
vip3   1/1     Running   0          19m   10.32.0.3   master   <none>           <none>
vip4   1/1     Running   0          19m   10.32.0.4   master   <none>           <none>

We can see that the nodeSelector is applied during Scheduling only: running PODs are kept running on the node, even if the removal of a label from the node will lead to a situation that the nodeSelector of the POD does not match for this node anymore.

The PODs are kept, even if some event is crashing the docker containers. Let us demonstrate that for POD vip1:

docker ps | grep vip1 | awk '{print $1}' | xargs docker kill
e42b2785ca6f
00c7c48ea1f9

The POD is restarted successfully on the master node, even though the nodeSelector does not match:

kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
vip1   1/1     Running   2          35m   10.32.0.5   master   <none>           <none>
vip2   1/1     Running   1          35m   10.32.0.2   master   <none>           <none>
vip3   1/1     Running   1          35m   10.32.0.3   master   <none>           <none>
vip4   1/1     Running   1          35m   10.32.0.4   master   <none>           <none>

Only, if the POD is deleted and re-created, scheduling of the POD will not find a matching node:

kubectl delete pod vip1

We now re-create the POD:

cat <<EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: vip1
spec:
  containers:
  - name: sleep
    image: busybox
    args:
    - sleep
    - "1000000"
  nodeSelector:
    vip: "maybe"

EOF

# output: 
pod/vip1 created

In this case, the POD is kept Pending, as expected:

kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE     IP          NODE     NOMINATED NODE   READINESS GATES
vip1   0/1     Pending   0          13s     <none>      <none>   <none>           <none>
vip2   1/1     Running   0          4m46s   10.40.0.4   master   <none>           <none>
vip3   1/1     Running   0          4m46s   10.40.0.3   master   <none>           <none>
vip4   1/1     Running   0          4m46s   10.40.0.1   master   <none>           <none>

Summary of Task 3

In task 3 we have seen that PODs are kept running on nodes, even if a label is removed from the node and the nodeSelector does not match anymore. Even if a container of the POD crashes or is killed, the container is restarted and the POD is still running on the node. Only, if the POD is deleted and re-created, it will enter the Pending status.

Summary

In this blog post, we have shown how to control, on which node a POD is running. For that, we have created PODs with a nodeSelector that matches only a single node.

We have shown, that a POD with a nodeSelector that does not match any node is kept Pending. As soon as a node shows up that has a matching label, the PODs are scheduled to run on the node.

We also have shown what happens to PODs, if labels are removed from a node. If the POD’s nodeSelector does not match the node’s labels anymore, the POD is kept running on the node, even if a container of the POD crashes.

One comment

Comments

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.