Migration of the kubeadm-based Kubernetes from Docker to containerd should be easy. In my case, it wasn’t. Wrong information found on the Internet combined with the incompatibility of the used Kubernetes and containerd versions has caused major problems. Fortunately, I had created full snapshots of my virtual master and nodes, so I was able to try from scratch several times. This step-by-step guide describes, how it finally has worked for me.

References:

Overview

Recommended workflow:

  • start with the master(s), then perform this guide node by node
  • Upgrade kubernetes, if needed (we have upgraded from v1.21.3 to v1.22.9)
  • Create a full backup or snapshot before kubernetes upgrade and before migration
  • First, move from docker to containerd and perform a reboot and check the system (Steps 2.x)
  • Then optionally update to the latest supported containerd version (steps 3.x). Note: for kubernetes 1.22.x, containerd 1.5.10 is supported, but 1.6.x does not work. I had tried it all in one step, and this has caused longer troubleshooting sessions.
  • See below the appendix with some error messages you might encounter, together with solution proposals.

Step 1 (optional): Upgrade Kubernetes

The following steps were tested with Kubernetes v1.22.9. Kubernetes version 1.24 is already available, but I did not dare to upgrade the cluster and make my Docker-based systems unusable before migrating. If you have a system that is older than v1.22.9, we recommend performing an upgrade.

Step 2: Migrate from Docker to containerd

Better create another backup or snapshot here…

Step 2.1: Stop all Containers

kubectl drain $(hostname) --ignore-daemonsets --delete-emptydir-data

Step 2.2: Configure containerd

Reset the container configuration to default (make a backup of /etc/containerd/config.toml, if needed).

containerd config default | sudo tee /etc/containerd/config.toml

Step 2.3: Prepare the System

Prerequisites as defined in the official documentation:

cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

# Setup required sysctl params, these persist across reboots.
cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables  = 1
net.ipv4.ip_forward                 = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF

# Apply sysctl params without reboot
sudo sysctl --system

Step 2.4: Restart containerd

Now all preparations are done to restart containerd:

sudo systemctl enable containerd
sudo systemctl restart containerd
sleep 10
sudo systemctl status containerd

Step 2.5: Reconfigure kubelet

Now let us reconfigure Kubernetes to use containerd as its runtime:

sudo cat /var/lib/kubelet/kubeadm-flags.env | grep containerd || sudo sed -i 's|KUBELET_KUBEADM_ARGS="|KUBELET_KUBEADM_ARGS="--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock |' /var/lib/kubelet/kubeadm-flags.env

Here, we need to perform a manual step:

kubectl edit node $(hostname)

In the editor, which opens, we need to change the value of kubeadm.alpha.kubernetes.io/cri-socket from /var/run/dockershim.sock to the CRI socket path unix:///run/containerd/containerd.sock.

Now we can restart the kubelet:

sudo systemctl restart kubelet
sleep 10
sudo systemctl status kubelet

Step 2.6: Check Kubernetes Status

Now we should see the new container runtime when running the following command:

kubectl get nodes -o wide

We should see an output similar to the following:

NAME          STATUS   ROLES                  AGE    VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
dev-master1   Ready    control-plane,master   369d   v1.22.9   116.203.100.241           CentOS Linux 7 (Core)   3.10.0-1160.25.1.el7.x86_64   containerd://1.2.13
dev-node1     Ready                           369d   v1.22.9   116.203.107.195           CentOS Linux 7 (Core)   3.10.0-1160.25.1.el7.x86_64   containerd://1.5.10
dev-node2     Ready                           295d   v1.22.9   116.203.74.214            CentOS Linux 7 (Core)   3.10.0-1160.31.1.el7.x86_64   containerd://1.5.10
...

This might show a mixed state with some nodes running docker and others running containerd.

Step 2.7: Enable Scheduling of Containers

Now, that the node is migrated, we can schedule containers on it:

kubectl uncordon $(hostname)

Step 2.8: Check the System

Now the containers are started again on the node. We can check this with the following command:

kubectl get pod --all-namespaces -o wide

Here you might see some important PODs in CrashLoopBackOff status. In my case, on my PROD cluster, etcd-master1, kube-apiserver-master1 and kube-controller-manager-master1 had this bad status. The details in the Event were like follows:

kubectl -n kube-system describe pod etcd-master1 | grep -A 100 Events

The output was:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 10m (x2 over 10m) kubelet Liveness probe failed: Get "http://127.0.0.1:2381/health": dial tcp 127.0.0.1:2381: connect: connection refused
Normal SandboxChanged 10m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 10m kubelet Container image "k8s.gcr.io/etcd:3.5.0-0" already present on machine
Normal Created 10m kubelet Created container etcd
Normal Started 10m kubelet Started container etcd
Normal Pulling 6m17s kubelet Pulling image "k8s.gcr.io/etcd:3.5.0-0"
Normal Pulled 5m58s kubelet Successfully pulled image "k8s.gcr.io/etcd:3.5.0-0" in 18.49950886s
Normal Created 5m23s (x4 over 5m58s) kubelet Created container etcd
Normal Started 5m23s (x4 over 5m58s) kubelet Started container etcd
Normal Pulled 5m23s (x3 over 5m58s) kubelet Container image "k8s.gcr.io/etcd:3.5.0-0" already present on machine
Warning BackOff 65s (x36 over 5m57s) kubelet Back-off restarting failed container

However, this was fixed by a reboot (I hope, this is the case for you as well).

Step 2.9 (optional): Disable Docker

We still need Docker for the CI’s docker build-processes on my DEV cluster. This is, why I have not disabled Docker on the DEV cluster. However, Docker is not needed in production anymore, so I have disabled Docker on my PROD cluster:

sudo systemctl disable docker --now
sudo systemctl stop docker

Now that the Check that docker is stopped. If it is not stopped, the following reboot might help. If this is still not helping, you can also remove docker from the system.

sudo yum remove -y docker

Step 2.10: Check again after Reboot

In my case, there were still Docker containers running. I did not disable Docker, since our CI still needs to run docker build commands. In any case, after such a big change, we should perform a reboot:

sudo reboot

Wait 2 minutes, reconnect to the system and check the PODs on the kube-system namespace again:

kubectl -n kube-system get pod -o wide

This time, the output looked like follows in my PROD system:

$ kubectl -n kube-system get pod -o wide
NAME                              READY   STATUS    RESTARTS         AGE     IP               NODE      NOMINATED NODE   READINESS GATES
coredns-78fcd69978-8qj4j          1/1     Running   0                3d20h   10.36.0.7        node2                
coredns-78fcd69978-xs2tl          1/1     Running   3 (4m4s ago)     3d20h   10.44.0.2        master1              
etcd-master1                      1/1     Running   9 (9m18s ago)    3d20h   78.47.138.201    master1              
kube-apiserver-master1            1/1     Running   9 (9m19s ago)    3d20h   78.47.138.201    master1              
kube-controller-manager-master1   1/1     Running   10 (8m52s ago)   3d20h   78.47.138.201    master1              
kube-proxy-8z8tb                  1/1     Running   4 (4m4s ago)     3d20h   78.47.138.201    master1              
kube-proxy-jtzh2                  1/1     Running   0                3d20h   195.201.16.123   node2                
kube-proxy-qq4js                  1/1     Running   0                3d20h   116.203.98.191   node1                
kube-scheduler-master1            1/1     Running   9 (9m5s ago)     3d20h   78.47.138.201    master1              
metrics-server-6597f96bb5-tgxdr   1/1     Running   0                3d20h   10.36.0.1        node2                
weave-net-8lb4h                   2/2     Running   92 (3m24s ago)   502d    78.47.138.201    master1              
weave-net-l8zx8                   2/2     Running   3 (108d ago)     268d    195.201.16.123   node2                
weave-net-qp8nr                   2/2     Running   38 (108d ago)    502d    116.203.98.191   node1                

To be sure that everything is up and running, you should issue the following set of commands and check that everything is up and running:

sudo systemctl status containerd
sudo systemctl status kubelet
kubectl get nodes -o wide
sudo docker ps # should be empty or create an error, if docker is disabled
sudo ctr -n k8s.io containers ls
kubectl get pod --all-namespaces -o wide

Step 2.11: You are running weave and it is still not working?

Do you have weave as your container network interface? Before I found that most of my problems were caused by an incompatible containerd version, I also tried to reinstall weave. I am not 100% sure yet if this was needed (I will find out when I perform the migration on our other clusters, and I will update the blog post). If you want to give it a try, this is, what I have done:

sudo kubectl apply -f "https://cloud.weave.works/k8s/net.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')" 2

# restart of weave PODs
kubectl -n kube-system get pods

# and then delete all weave PODs, e.g. 
kubectl -n kube-system delete pod weave-net-vcvct

Step 3 (optional): Upgrade containerd

Please make sure that the Kubernetes version supports the used containerd version. We have performed the upgrade based on Kubernetes version v1.22.9. This version does not support containerd v1.6.x. The latest supported version is 1.5.x, so we have upgraded containerd to v1.5.10.

The next step will overwrite the containerd config file, but it will keep a backup. But node that all containerd binaries are overwritten without the creation of a backup

# VERSION=1.5.10
# or get latest 1.5.x version via:
VERSION=$(curl -s -L https://github.com/containerd/containerd/releases/ \
  | egrep 'download\/v1\.5\.[0-9]*\/containerd-1\.5\.[0-9]*-linux-amd64.tar.gz"' \
  | tail -1 | awk -F'"' '{print $2}' | awk -F'download/v' '{print $2}' | awk -F'/' '{print $1}')
echo VERSION=$VERSION
kubectl drain $(hostname) --ignore-daemonsets --delete-emptydir-data
curl -L https://github.com/containerd/containerd/releases/download/v${VERSION}/containerd-${VERSION}-linux-amd64.tar.gz | sudo tar -xvz -C /usr/
sudo cp /etc/containerd/config.toml /etc/containerd/config.toml.bak-$(date +"%Y-%m-%dT%H:%M:%S")
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl enable containerd # on all machines
sudo systemctl restart containerd # on all machines
sleep 10
sudo systemctl status containerd # on all machines
kubectl uncordon $(hostname)

Now we should perform a reboot and check the system as follows:

sudo reboot

Let us wait for 2 minutes before we reconnect to the system and we check the system again:

sudo systemctl status containerd
sudo systemctl status kubelet
kubectl get nodes -o wide
sudo docker ps # should be empty or create an error, if docker is disabled: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
sudo ctr -n k8s.io containers ls
kubectl get pod --all-namespaces -o wide

Appendix: Errors

A.1 FailedCreatePodSandBox

This error was seen after upgrading containerd to v1.6:

kubectl -n jenkins describe pod ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-rv2sx

output:

Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
...
  Warning  FailedCreatePodSandBox  59s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7de256b329f49218028bf97137dcc50a427eb9dc09131f7d2c27c165e66509bc": failed to find network info for sandbox "7de256b329f49218028bf97137dcc50a427eb9dc09131f7d2c27c165e66509bc

The error was also seen on the containerd service:

sudo systemctl status containerd -l
# output:
? containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2022-05-10 08:13:24 CEST; 2h 28min ago
     Docs: https://containerd.io
  Process: 25651 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 25653 (containerd)

...
May 10 10:40:44 dev-node2 containerd[25653]: time="2022-05-10T10:40:44.831875685+02:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-nstmg,Uid:191501c2-e064-425c-83b3-ebf59e36651e,Namespace:jenkins,Attempt:0,}"
May 10 10:40:44 dev-node2 containerd[25653]: time="2022-05-10T10:40:44.978221610+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-nstmg,Uid:191501c2-e064-425c-83b3-ebf59e36651e,Namespace:jenkins,Attempt:0,} failed, error" error="failed to setup network for sandbox \"ec39e4b097041b53c59bcd2396f270b1a6eb598bc20a8d7065189ab58e9942f6\": failed to find network info for sandbox \"ec39e4b097041b53c59bcd2396f270b1a6eb598bc20a8d7065189ab58e9942f6\""
May 10 10:40:59 dev-node2 containerd[25653]: time="2022-05-10T10:40:59.832134457+02:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-nstmg,Uid:191501c2-e064-425c-83b3-ebf59e36651e,Namespace:jenkins,Attempt:0,}"
...

Solution:

I had to downgrade conainerd from v1.6.x to v1.5.x since my Kubernetes version v1.22.x does not support containerd 1.6.

A.2 ctr: failed to dial…

$ ctr -n k8s.io containers ls
ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: permission denied"

Solution:

You need to run ctr as root, so add a sudo:

sudo ctr -n k8s.io containers ls

A.3 ctr: context deadline exceeded

When running

sudo ctr -n k8s.io containers ls

I got:

ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded

This is caused if Docker is disabled and containerd is still configured to use Docker.

Solution: I have resolved this by creating a new default configuration for containerd:

containerd config default | sudo tee /etc/containerd/config.toml

A.4 kubectl error: You must be logged in to the server

I had tried to update weave, but this can happen for any kubectl command:

sudo kubectl apply -f "https://cloud.weave.works/k8s/v1.10/net.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')"
error: You must be logged in to the server (the server has asked for the client to provide credentials)

This was caused by an outdated kube-config file for the root user after upgrading kubernetes. To update the file, perform the following commands:

# for normal users:
sudo mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# for root:
sudo mkdir -p /root/.kube
sudo cp -i /etc/kubernetes/admin.conf /root/.kube/config

I hope that helps. If there are errors, please feel free to send a comment.

Comments

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.