Migration of the kubeadm-based Kubernetes from Docker to containerd should be easy. In my case, it wasn’t. Wrong information found on the Internet combined with the incompatibility of the used Kubernetes and containerd versions has caused major problems. Fortunately, I had created full snapshots of my virtual master and nodes, so I was able to try from scratch several times. This step-by-step guide describes, how it finally has worked for me.
References:
- This is the official Documentation: Changing the Container Runtime on a Node from Docker Engine to containerd
- When you ask Google about „kubernetes migrate from docker to containerd“, you also find Part 2: How to migrate to containerd and CRI-O after Dockershim Deprecation in Kubernetes 1.24, but it has a typo which has created quite a headache. Unfortunately, they do not offer a possibility to inform them about the typo.
Overview
Recommended workflow:
- start with the master(s), then perform this guide node by node
- Upgrade kubernetes, if needed (we have upgraded from v1.21.3 to v1.22.9)
- Create a full backup or snapshot before kubernetes upgrade and before migration
- First, move from docker to containerd and perform a reboot and check the system (Steps 2.x)
- Then optionally update to the latest supported containerd version (steps 3.x). Note: for kubernetes 1.22.x, containerd 1.5.10 is supported, but 1.6.x does not work. I had tried it all in one step, and this has caused longer troubleshooting sessions.
- See below the appendix with some error messages you might encounter, together with solution proposals.
Step 1 (optional): Upgrade Kubernetes
The following steps were tested with Kubernetes v1.22.9. Kubernetes version 1.24 is already available, but I did not dare to upgrade the cluster and make my Docker-based systems unusable before migrating. If you have a system that is older than v1.22.9, we recommend performing an upgrade.
See here an example of how to upgrade the cluster: https://vocon-it.com/2022/05/31/upgrade-kubernetes-cluster-cheat-sheet/
Step 2: Migrate from Docker to containerd
Better create another backup or snapshot here…
Step 2.1: Stop all Containers
kubectl drain $(hostname) --ignore-daemonsets --delete-emptydir-data
Step 2.2: Configure containerd
Reset the container configuration to default (make a backup of /etc/containerd/config.toml
, if needed).
containerd config default | sudo tee /etc/containerd/config.toml
Step 2.3: Prepare the System
Prerequisites as defined in the official documentation:
cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf overlay br_netfilter EOF sudo modprobe overlay sudo modprobe br_netfilter # Setup required sysctl params, these persist across reboots. cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1 net.bridge.bridge-nf-call-ip6tables = 1 EOF # Apply sysctl params without reboot sudo sysctl --system
Step 2.4: Restart containerd
Now all preparations are done to restart containerd:
sudo systemctl enable containerd sudo systemctl restart containerd sleep 10 sudo systemctl status containerd
Step 2.5: Reconfigure kubelet
Now let us reconfigure Kubernetes to use containerd as its runtime:
sudo cat /var/lib/kubelet/kubeadm-flags.env | grep containerd || sudo sed -i 's|KUBELET_KUBEADM_ARGS="|KUBELET_KUBEADM_ARGS="--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock |' /var/lib/kubelet/kubeadm-flags.env
Here, we need to perform a manual step:
kubectl edit node $(hostname)
In the editor, which opens, we need to change the value of kubeadm.alpha.kubernetes.io/cri-socket
from /var/run/dockershim.sock
to the CRI socket path unix:///run/containerd/containerd.sock
.
Now we can restart the kubelet:
sudo systemctl restart kubelet sleep 10 sudo systemctl status kubelet
Step 2.6: Check Kubernetes Status
Now we should see the new container runtime when running the following command:
kubectl get nodes -o wide
We should see an output similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME dev-master1 Ready control-plane,master 369d v1.22.9 116.203.100.241 CentOS Linux 7 (Core) 3.10.0-1160.25.1.el7.x86_64 containerd://1.2.13 dev-node1 Ready 369d v1.22.9 116.203.107.195 CentOS Linux 7 (Core) 3.10.0-1160.25.1.el7.x86_64 containerd://1.5.10 dev-node2 Ready 295d v1.22.9 116.203.74.214 CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 containerd://1.5.10 ...
This might show a mixed state with some nodes running docker and others running containerd.
Step 2.7: Enable Scheduling of Containers
Now, that the node is migrated, we can schedule containers on it:
kubectl uncordon $(hostname)
Step 2.8: Check the System
Now the containers are started again on the node. We can check this with the following command:
kubectl get pod --all-namespaces -o wide
Here you might see some important PODs in CrashLoopBackOff status. In my case, on my PROD cluster, etcd-master1, kube-apiserver-master1 and kube-controller-manager-master1 had this bad status. The details in the Event were like follows:
kubectl -n kube-system describe pod etcd-master1 | grep -A 100 EventsThe output was:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 10m (x2 over 10m) kubelet Liveness probe failed: Get "http://127.0.0.1:2381/health": dial tcp 127.0.0.1:2381: connect: connection refused Normal SandboxChanged 10m kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 10m kubelet Container image "k8s.gcr.io/etcd:3.5.0-0" already present on machine Normal Created 10m kubelet Created container etcd Normal Started 10m kubelet Started container etcd Normal Pulling 6m17s kubelet Pulling image "k8s.gcr.io/etcd:3.5.0-0" Normal Pulled 5m58s kubelet Successfully pulled image "k8s.gcr.io/etcd:3.5.0-0" in 18.49950886s Normal Created 5m23s (x4 over 5m58s) kubelet Created container etcd Normal Started 5m23s (x4 over 5m58s) kubelet Started container etcd Normal Pulled 5m23s (x3 over 5m58s) kubelet Container image "k8s.gcr.io/etcd:3.5.0-0" already present on machine Warning BackOff 65s (x36 over 5m57s) kubelet Back-off restarting failed containerHowever, this was fixed by a reboot (I hope, this is the case for you as well).
Step 2.9 (optional): Disable Docker
We still need Docker for the CI’s docker build-processes on my DEV cluster. This is, why I have not disabled Docker on the DEV cluster. However, Docker is not needed in production anymore, so I have disabled Docker on my PROD cluster:
sudo systemctl disable docker --now sudo systemctl stop docker
Now that the Check that docker is stopped. If it is not stopped, the following reboot might help. If this is still not helping, you can also remove docker from the system.
sudo yum remove -y docker
Step 2.10: Check again after Reboot
In my case, there were still Docker containers running. I did not disable Docker, since our CI still needs to run docker build commands. In any case, after such a big change, we should perform a reboot:
sudo reboot
Wait 2 minutes, reconnect to the system and check the PODs on the kube-system namespace again:
kubectl -n kube-system get pod -o wide
This time, the output looked like follows in my PROD system:
$ kubectl -n kube-system get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-78fcd69978-8qj4j 1/1 Running 0 3d20h 10.36.0.7 node2 coredns-78fcd69978-xs2tl 1/1 Running 3 (4m4s ago) 3d20h 10.44.0.2 master1 etcd-master1 1/1 Running 9 (9m18s ago) 3d20h 78.47.138.201 master1 kube-apiserver-master1 1/1 Running 9 (9m19s ago) 3d20h 78.47.138.201 master1 kube-controller-manager-master1 1/1 Running 10 (8m52s ago) 3d20h 78.47.138.201 master1 kube-proxy-8z8tb 1/1 Running 4 (4m4s ago) 3d20h 78.47.138.201 master1 kube-proxy-jtzh2 1/1 Running 0 3d20h 195.201.16.123 node2 kube-proxy-qq4js 1/1 Running 0 3d20h 116.203.98.191 node1 kube-scheduler-master1 1/1 Running 9 (9m5s ago) 3d20h 78.47.138.201 master1 metrics-server-6597f96bb5-tgxdr 1/1 Running 0 3d20h 10.36.0.1 node2 weave-net-8lb4h 2/2 Running 92 (3m24s ago) 502d 78.47.138.201 master1 weave-net-l8zx8 2/2 Running 3 (108d ago) 268d 195.201.16.123 node2 weave-net-qp8nr 2/2 Running 38 (108d ago) 502d 116.203.98.191 node1
To be sure that everything is up and running, you should issue the following set of commands and check that everything is up and running:
sudo systemctl status containerd sudo systemctl status kubelet kubectl get nodes -o wide sudo docker ps # should be empty or create an error, if docker is disabled sudo ctr -n k8s.io containers ls kubectl get pod --all-namespaces -o wide
Step 2.11: You are running weave and it is still not working?
Do you have weave as your container network interface? Before I found that most of my problems were caused by an incompatible containerd version, I also tried to reinstall weave. I am not 100% sure yet if this was needed (I will find out when I perform the migration on our other clusters, and I will update the blog post). If you want to give it a try, this is, what I have done:
sudo kubectl apply -f "https://cloud.weave.works/k8s/net.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')" 2 # restart of weave PODs kubectl -n kube-system get pods # and then delete all weave PODs, e.g. kubectl -n kube-system delete pod weave-net-vcvct
Step 3 (optional): Upgrade containerd
Please make sure that the Kubernetes version supports the used containerd version. We have performed the upgrade based on Kubernetes version v1.22.9. This version does not support containerd v1.6.x. The latest supported version is 1.5.x, so we have upgraded containerd to v1.5.10.
The next step will overwrite the containerd config file, but it will keep a backup. But node that all containerd binaries are overwritten without the creation of a backup
# VERSION=1.5.10 # or get latest 1.5.x version via: VERSION=$(curl -s -L https://github.com/containerd/containerd/releases/ \ | egrep 'download\/v1\.5\.[0-9]*\/containerd-1\.5\.[0-9]*-linux-amd64.tar.gz"' \ | tail -1 | awk -F'"' '{print $2}' | awk -F'download/v' '{print $2}' | awk -F'/' '{print $1}') echo VERSION=$VERSION kubectl drain $(hostname) --ignore-daemonsets --delete-emptydir-data curl -L https://github.com/containerd/containerd/releases/download/v${VERSION}/containerd-${VERSION}-linux-amd64.tar.gz | sudo tar -xvz -C /usr/ sudo cp /etc/containerd/config.toml /etc/containerd/config.toml.bak-$(date +"%Y-%m-%dT%H:%M:%S") containerd config default | sudo tee /etc/containerd/config.toml sudo systemctl enable containerd # on all machines sudo systemctl restart containerd # on all machines sleep 10 sudo systemctl status containerd # on all machines kubectl uncordon $(hostname)
Now we should perform a reboot and check the system as follows:
sudo reboot
Let us wait for 2 minutes before we reconnect to the system and we check the system again:
sudo systemctl status containerd sudo systemctl status kubelet kubectl get nodes -o wide sudo docker ps # should be empty or create an error, if docker is disabled: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? sudo ctr -n k8s.io containers ls kubectl get pod --all-namespaces -o wide
Appendix: Errors
A.1 FailedCreatePodSandBox
This error was seen after upgrading containerd to v1.6:
kubectl -n jenkins describe pod ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-rv2sx
output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- ... Warning FailedCreatePodSandBox 59s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7de256b329f49218028bf97137dcc50a427eb9dc09131f7d2c27c165e66509bc": failed to find network info for sandbox "7de256b329f49218028bf97137dcc50a427eb9dc09131f7d2c27c165e66509bc
The error was also seen on the containerd service:
sudo systemctl status containerd -l # output: ? containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2022-05-10 08:13:24 CEST; 2h 28min ago Docs: https://containerd.io Process: 25651 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 25653 (containerd) ... May 10 10:40:44 dev-node2 containerd[25653]: time="2022-05-10T10:40:44.831875685+02:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-nstmg,Uid:191501c2-e064-425c-83b3-ebf59e36651e,Namespace:jenkins,Attempt:0,}" May 10 10:40:44 dev-node2 containerd[25653]: time="2022-05-10T10:40:44.978221610+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-nstmg,Uid:191501c2-e064-425c-83b3-ebf59e36651e,Namespace:jenkins,Attempt:0,} failed, error" error="failed to setup network for sandbox \"ec39e4b097041b53c59bcd2396f270b1a6eb598bc20a8d7065189ab58e9942f6\": failed to find network info for sandbox \"ec39e4b097041b53c59bcd2396f270b1a6eb598bc20a8d7065189ab58e9942f6\"" May 10 10:40:59 dev-node2 containerd[25653]: time="2022-05-10T10:40:59.832134457+02:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:ure-2fcaas-1176-ram-usage-is-shown-in-mb-instead-of-mi-16-nstmg,Uid:191501c2-e064-425c-83b3-ebf59e36651e,Namespace:jenkins,Attempt:0,}" ...
Solution:
I had to downgrade conainerd from v1.6.x to v1.5.x since my Kubernetes version v1.22.x does not support containerd 1.6.
A.2 ctr: failed to dial…
$ ctr -n k8s.io containers ls ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: permission denied"
Solution:
You need to run ctr as root, so add a sudo:
sudo ctr -n k8s.io containers ls
A.3 ctr: context deadline exceeded
When running
sudo ctr -n k8s.io containers ls
I got:
ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
This is caused if Docker is disabled and containerd is still configured to use Docker.
Solution: I have resolved this by creating a new default configuration for containerd:
containerd config default | sudo tee /etc/containerd/config.toml
A.4 kubectl error: You must be logged in to the server
I had tried to update weave, but this can happen for any kubectl command:
sudo kubectl apply -f "https://cloud.weave.works/k8s/v1.10/net.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')" error: You must be logged in to the server (the server has asked for the client to provide credentials)
This was caused by an outdated kube-config file for the root user after upgrading kubernetes. To update the file, perform the following commands:
# for normal users: sudo mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config # for root: sudo mkdir -p /root/.kube sudo cp -i /etc/kubernetes/admin.conf /root/.kube/config
I hope that helps. If there are errors, please feel free to send a comment.
Excellent guide! thanks
Hi Roberto, thanks a lot for your nice words!