This lab is concentrating on Kubernetes Resource Management. We will explore Limit ranges for containers applied on Deployment-level as well as on Namespace-level. We also will discuss the possibility to limit the sum of resources on Namespace-level using ResourceQuotas.

The tests are performed on the Katacoda Kubernetes Playground again. Even though I had not expected it, the Katacoda platform offers enough resources and possibilities to limit those.

Phase 1: Container Resource Limits of Deployments

Step 1.1: Deploy a Stressful POD

On the master, we first create a YAML file with the –dry-run option shown in the previous blog post, before we apply the file and get an overview of the result:

kubectl create deployment stress --image vish/stress --dry-run -o yaml > stress.yaml
kubectl apply -f stress.yaml

# output: deployment.apps/stress created

kubectl get deployments

# output:
NAME     READY   UP-TO-DATE   AVAILABLE   AGE
stress   1/1     1            1           75s

Step 1.2: Set Resource Limits for the POD

Now we change the yaml file and replace the line

 resources: {}

by

        resources:
          limits:
            memory: "500Mi"
          requests:
            memory: "250Mi"

The yaml file looks as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: vish/stress
        name: stress
        resources:
          limits:
            memory: "500Mi"
          requests:
            memory: "250Mi"
        terminationMessagePolicy: FallbackToLogsOnError
status: {}

Note: Blanks must not be used for the memory.

I got the error message Limits: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ... first, because I had tried to set the memory Limit to „500 Mi“ instead of „500Mi“ first. Removing the space has fixed the problem.

Other from what you could think, starting the stress container does not stress the system yet;

kubectl logs stress-6f8b598b78-8s94p

# output:
I0719 15:42:52.484422       1 main.go:26] Allocating "0" memory, in "4Ki" chunks, with a 1ms sleep between allocations
I0719 15:42:52.484525       1 main.go:29] Allocated "0" memory

Step 1.3: Allocate POD Resources below Limit

For stressing the system we need to run the stress container with some additional parameters:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: vish/stress
        name: stress
        resources:
          limits:
            cpu: "1"
            memory: "500Mi"
          requests:
            cpu: "0.5"
            memory: "250Mi"
        args:
        - -cpus
        - "2"
        - -mem-total
        - "400Mi"
        - -mem-alloc-size
        - "100Mi"
        - -mem-alloc-sleep
        - "1s"
        terminationMessagePolicy: FallbackToLogsOnError

Now we can see on the worker node that the stress process is consuming almost 100% CPU and 400Mi:

node01 $ top

top - 16:10:32 up 48 min,  1 user,  load average: 0.97, 0.56, 0.27
Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.5 us, 20.3 sy,  0.0 ni, 74.1 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  4045932 total,  2562724 free,   688828 used,   794380 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3109212 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6727 root      20   0  431604 426396   3184 S 100.0 10.5   2:41.12 stress
 1512 root      20   0  852216  92716  60100 S   2.3  2.3   1:32.58 kubelet
 1002 root      20   0  731160  92372  39316 S   1.0  2.3   0:49.77 dockerd
    7 root      20   0       0      0      0 S   0.3  0.0   0:01.05 rcu_sched
 6868 root      20   0       0      0      0 S   0.3  0.0   0:00.23 kworker/u8:3
    1 root      20   0   38080   6124   4008 S   0.0  0.2   0:04.48 systemd
...

Step 1.4: Allocate POD Resources above Limit

Now let us see, what happens if stress tries to allocate more memory than the limit: we set the Memory consumption of the Stress container above the limit of 500Mi:

    spec:
      containers:
      - image: vish/stress
        name: stress
        resources:
          limits:
            cpu: "1"
            memory: "500Mi"
          requests:
            cpu: "0.5"
            memory: "250Mi"
        args:
        - -cpus
        - "2"
        - -mem-total
        - "600Mi"
        - -mem-alloc-size
        - "100Mi"
        - -mem-alloc-sleep
        - "1s"
        terminationMessagePolicy: FallbackToLogsOnError

We re-apply the deployment on the master

kubectl replace -f stress.yaml

and watch, what happens on the worker node:

node01 $ watch "kubectl get pods"

We will see something like follows:

master $ k get pods
NAME                      READY   STATUS    RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   1/1     Running   1          15s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          18s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          19s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          20s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          21s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          23s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          24s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          25s
master $ k get pods
NAME                      READY   STATUS      RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     OOMKilled   1          27s
master $ k get pods
NAME                      READY   STATUS             RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   0/1     CrashLoopBackOff   1          29s
master $ k get pods
NAME                      READY   STATUS    RESTARTS   AGE
stress-7bd7c8c65d-5xkhs   1/1     Running   2          30s

We can see, that the POD turns from Running status to OOMKilled status via CrashLoopBackOff status to Running status again.

So, we can see, that Kubernetes just kills any POD that exceeds the POD resource limits. This is not friendly, but effective.

Let us see what happens, if CPU is exceeded only.

We had specified a CPU limit of 1, but stress tried to allocate 2 CPUs. Why wasn’t it killed before we have increades the memory needs of the stress process? The reason is probably that there is only on CPU:

node01 $ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping        : 1
microcode       : 0x1
cpu MHz         : 2099.996
cache size      : 16384 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nxpdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
bogomips        : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping        : 1
microcode       : 0x1
cpu MHz         : 2099.996
cache size      : 16384 KB
physical id     : 1
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nxpdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
bogomips        : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping        : 1
microcode       : 0x1
cpu MHz         : 2099.996
cache size      : 16384 KB
physical id     : 2
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nxpdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
bogomips        : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping        : 1
microcode       : 0x1
cpu MHz         : 2099.996
cache size      : 16384 KB
physical id     : 3
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nxpdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
bogomips        : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Step 1.5: Exceeding the CPU Limit

So, let us use less memory, but limit the CPU to 0.5, so the stress process has the possibility to exceed the CPU limit:

# stress.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: vish/stress
        name: stress
        resources:
          limits:
            cpu: "0.4"
            memory: "500Mi"
          requests:
            cpu: "0.1"
            memory: "250Mi"
        args:
        - -cpus
        - "1"
        - -mem-total
        - "100Mi"
        - -mem-alloc-size
        - "100Mi"
        - -mem-alloc-sleep
        - "1s"
        terminationMessagePolicy: FallbackToLogsOnError
status: {}

I would have expected that the POD is killed again, but this time this did not happen:

k get pods
NAME                     READY   STATUS    RESTARTS   AGE
stress-d5bf8ff87-pvvb4   1/1     Running   0          2m13s

The behavior is much better than that: an „exceeded“ CPU limit does not happen, since Kubernetes is intelligent enough to manipulate the Linux kernel’s scheduler, so the process never exceeds the limit:

node01 $ top

top - 16:42:48 up  1:20,  1 user,  load average: 0.54, 0.51, 0.39
Tasks: 132 total,   2 running, 130 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.0 us,  8.1 sy,  0.0 ni, 88.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  4045932 total,  2875716 free,   370532 used,   799684 buff/cacheKiB Swap:        0 total,        0 free,        0 used.  3425560 avail Mem  
PID USER        PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10426 root      20   0  113448 109280   3120 R  39.9  2.7   1:12.46 stress
 1512 root      20   0  852216  94204  60280 S   2.3  2.3   2:37.44 kubelet
 1002 root      20   0  731672  90128  39316 S   1.0  2.2   1:24.73 dockerd
...

That is nice! Instead of killing the POD, it just is limited.

However, if a single POD cannot consume more than 40% of the CPU, can we just scale the application to circumvent the limitation? Let us scale the

Step 1.6: Increasing the number of PODs

k scale deployment stress --replicas=3

On the worker node, we see that each POD is limited to 40% CPU, but the whole deployment can consume many times of this CPU by scaling horizontally:

node01 $ top

top - 17:08:05 up 55 min,  1 user,  load average: 2.68, 2.60, 1.27
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.1 us, 23.7 sy,  0.0 ni, 69.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.1 st
KiB Mem :  4045932 total,  2681328 free,   592728 used,   771876 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3197920 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6617 root      20   0  113192 109544   3120 S  39.9  2.7   2:29.96 stress
 7104 root      20   0  113192 109664   3248 S  39.9  2.7   2:07.84 stress
 6884 root      20   0  113448 109284   3120 S  39.5  2.7   2:15.97 stress
 1548 root      20   0 1081608  95244  61412 S   2.7  2.4   1:42.81 kubelet
  994 root      20   0 1033228  94284  39756 S   1.0  2.3   0:56.09 dockerd
 9483 root      20   0       0      0      0 S   0.3  0.0   0:00.17 kworker/u8

Okay, the total amount of CPU is not limited for the deployment. Can we just move the resource limit to the spec of the deployment? Let us try: let us use less memory, but limit the CPU to 0.5, so the stress process has the possibility to exceed the CPU limit:

# stress-deployment-limit.yaml -- failed
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  resources:
    limits:
      cpu: "0.4"
      memory: "500Mi"
    requests:
      cpu: "0.1"
      memory: "250Mi"  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: vish/stress
        name: stress
        resources: {}
        args:
        - -cpus
        - "1"
        - -mem-total
        - "100Mi"
        - -mem-alloc-size
        - "100Mi"
        - -mem-alloc-sleep
        - "1s"
        terminationMessagePolicy: FallbackToLogsOnError
status: {}

No, we cannot use resource limits on the Deployment level:

k replace -f stress-deployment-limit.yaml
error: error validating "stress-deployment-limit.yaml": error validating data: ValidationError(Deployment.spec): unknown field "resources" in io.k8s.api.apps.v1.DeploymentSpec; if you choose to ignore these errors, turn validation off with --validate=false

Phase 2: Namespace Level Policies

Step 2.1: Limit Ranges

Above, we have set limit ranges per container on the Deployment level by adding resources to the container spec within the deployment. The limit range was applied to all containers within the Deployment. Can we do something similar for Namespaces?

Yes, we can. We need to create an object of type LinitRange and create it for a certain namespace. Let us create the LimitRange YAML file:

# limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: low-resource-range
spec:
  limits:
  - type: Container
    default:
      cpu: 0.2
    defaultRequest:
      cpu: 0.1

Now let us create a namespace and apply the limit range to it. For that, you just need to create a limit range with the corresponding namespace:

k create namespace low-resource-range
# output: namespace/low-resource-range created

k create -f limitrange.yaml --namespace low-resource-range
# output: limitrange/low-resource-range created

Now we create a POD and verify that the POD has inherited the limit range from the namespace:

k create deployment nginx --image=nginx --namespace low-resource-range

We can see that the pod is up and running:

k get pods --namespace low-resource-range
NAME                     READY   STATUS    RESTARTS   AGE
nginx-65f88748fd-g88wg   1/1     Running   0          4m33s

Moreover, we can see that the POD has taken over the limit range:

kubectl describe pod nginx-65f88748fd-g88wg -n low-resource-range
...   
 Limits:
      cpu:  200m
    Requests:
      cpu:        100m
...

Interestingly, nothing of that sort can be seen on the deployment level: kubectl describe deployment nginx --namespace low-resource-range does not show any hint of the limit range.  This is something between the namespace and the POD. That may be better that way since a changed LimitRange will be applied any time a new POD comes up.

Step 2.2: Resource Quotas

We can apply Resource Quotas to a namespace as well. However, what is the difference between a LimitRange and ResourceQuota?

Before we have tried to apply limits to the sum of resources to a Deployment. This is not supported. However, we can limit the overall resources within a namespace through ResourceQuotas.

Since this was not part of the LSF458 lab, but when looking at the official documentation, the handling of ResourceQuotas is very similar to the one of LimitRanges. I have copied the example here:

kubectl create namespace myspace
cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    pods: "4"
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
    requests.nvidia.com/gpu: 4
EOF
kubectl create -f ./compute-resources.yaml --namespace=myspace

I have not tested it yet, but now the user should receive a 403 forbidden message if he tries to create a resource that exceeds the resource limits.

Next:

Node Maintenance

One comment

Comments

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.