0

Cluster information:

Kubernetes version: v1.28.2
Cloud being used: Virtualbox
Installation method: Kubernetes Cluster VirtualBox
Host OS: Ubuntu 22.04.3 LTS
CNI and version: calico
CRI and version: containerd://1.7.2

Cluster contains with 1 Master node and 2 Worker nodes. Once cluster is started for a moment (matter of 1-2 minutes since startup) looks good:

lab@master:~$ kubectl -nkube-system get po -o wide
NAME                                       READY   STATUS             RESTARTS          AGE     IP              NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running            12 (2m11s ago)    13d     10.10.219.98    master     <none>           <none>
calico-node-bqlnm                          1/1     Running            3 (2m11s ago)     4d2h    192.168.1.164   master     <none>           <none>
calico-node-mrd86                          1/1     Running            105 (2d20h ago)   4d2h    192.168.1.165   worker01   <none>           <none>
calico-node-r6w9s                          1/1     Running            110 (2d20h ago)   4d2h    192.168.1.166   worker02   <none>           <none>
coredns-5dd5756b68-njtpf                   1/1     Running            11 (2m11s ago)    13d     10.10.219.100   master     <none>           <none>
coredns-5dd5756b68-pxn8l                   1/1     Running            10 (2m11s ago)    13d     10.10.219.99    master     <none>           <none>
etcd-master                                1/1     Running            67 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-apiserver-master                      1/1     Running            43 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-controller-manager-master             1/1     Running            47 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-proxy-ffnzb                           1/1     Running            122 (95s ago)     12d     192.168.1.165   worker01   <none>           <none>
kube-proxy-hf4mx                           1/1     Running            108 (78s ago)     12d     192.168.1.166   worker02   <none>           <none>
kube-proxy-ql576                           1/1     Running            15 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-scheduler-master                      1/1     Running            46 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
metrics-server-54cb77cffd-q292x            0/1     CrashLoopBackOff   68 (18s ago)      3d21h   10.10.30.94     worker02   <none>           <none>

However, after some minutes later, pods in kube-system namespace start flapping/crashing.

lab@master:~$ kubectl -nkube-system get po
NAME                                       READY   STATUS             RESTARTS          AGE
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running            12 (19m ago)      13d
calico-node-bqlnm                          0/1     Running            3 (19m ago)       4d2h
calico-node-mrd86                          0/1     CrashLoopBackOff   111 (2m28s ago)   4d2h
calico-node-r6w9s                          0/1     CrashLoopBackOff   116 (2m15s ago)   4d2h
coredns-5dd5756b68-njtpf                   1/1     Running            11 (19m ago)      13d
coredns-5dd5756b68-pxn8l                   1/1     Running            10 (19m ago)      13d
etcd-master                                1/1     Running            67 (19m ago)      13d
kube-apiserver-master                      1/1     Running            43 (19m ago)      13d
kube-controller-manager-master             1/1     Running            47 (19m ago)      13d
kube-proxy-ffnzb                           0/1     CrashLoopBackOff   127 (42s ago)     12d
kube-proxy-hf4mx                           0/1     CrashLoopBackOff   113 (2m17s ago)   12d
kube-proxy-ql576                           1/1     Running            15 (19m ago)      13d
kube-scheduler-master                      1/1     Running            46 (19m ago)      13d
metrics-server-54cb77cffd-q292x            0/1     CrashLoopBackOff   73 (64s ago)      3d22h

It is completely unclear to me what is wrong and by checking pods description I see repeating events:

lab@master:~$ kubectl -nkube-system logs kube-proxy-ffnzb
.
.
.
Events:
  Type     Reason          Age                      From     Message
  ----     ------          ----                     ----     -------
  Normal   Killing         2d20h (x50 over 3d1h)    kubelet  Stopping container kube-proxy
  Warning  BackOff         2d20h (x1146 over 3d1h)  kubelet  Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)
  Normal   Pulled          8m11s (x3 over 10m)      kubelet  Container image "registry.k8s.io/kube-proxy:v1.28.6" already present on machine
  Normal   Created         8m10s (x3 over 10m)      kubelet  Created container kube-proxy
  Normal   Started         8m10s (x3 over 10m)      kubelet  Started container kube-proxy
  Normal   SandboxChanged  6m56s (x4 over 10m)      kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         4m41s (x4 over 10m)      kubelet  Stopping container kube-proxy
  Warning  BackOff         12s (x28 over 10m)       kubelet  Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)

Note! This situation does not prevent me to deploy some example deployments (nginx) - it seems to be running stable. Yet, I tried to add metrics-server and this one is crashing (possibly it is related to CrashLoopBackOff pods in kube-system namespace)

Any ideas what might be wrong/where else to look to troubleshoot?

0

1 Answer 1

0

I was given a hint by someone to check SystemdCgroup in containerd config file. Following this link.

In my case it turned out I was missing: /etc/containerd/config.toml on Master node.

  • To generate it:
    sudo containerd config default | sudo tee /etc/containerd/config.toml
    
  • Next change SystemdCgroup = true in /etc/containerd/config.toml
  • Restart containerd service:
    systemctl restart containerd
    

That however, put my cluster in following state:

lab@master:~$ kubectl -nkube-system get po
The connection to the server master:6443 was refused - did you specify the right host or port?
lab@master:~$ kubectl get nodes
The connection to the server master:6443 was refused - did you specify the right host or port?

I have reverted it back to false and restarted containerd. However, on Worker nodes I keep it as true.

That fixed the problem:

lab@master:~$ kubectl -nkube-system get po -o wide
NAME                                       READY   STATUS    RESTARTS       AGE    IP              NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running   8 (18m ago)    14d    10.10.219.86    master     <none>           <none>
calico-node-c4rxp                          1/1     Running   7 (14m ago)    89m    192.168.1.166   worker02   <none>           <none>
calico-node-dhzr8                          1/1     Running   7 (18m ago)    14d    192.168.1.164   master     <none>           <none>
calico-node-wqv8w                          1/1     Running   1 (14m ago)    27m    192.168.1.165   worker01   <none>           <none>
coredns-5dd5756b68-njtpf                   1/1     Running   7 (18m ago)    14d    10.10.219.88    master     <none>           <none>
coredns-5dd5756b68-pxn8l                   1/1     Running   6 (18m ago)    14d    10.10.219.87    master     <none>           <none>
etcd-master                                1/1     Running   62 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-apiserver-master                      1/1     Running   38 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-controller-manager-master             1/1     Running   42 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-proxy-mgsdr                           1/1     Running   7 (14m ago)    89m    192.168.1.166   worker02   <none>           <none>
kube-proxy-ql576                           1/1     Running   10 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-proxy-zl68t                           1/1     Running   8 (14m ago)    106m   192.168.1.165   worker01   <none>           <none>
kube-scheduler-master                      1/1     Running   41 (18m ago)   14d    192.168.1.164   master     <none>           <none>
metrics-server-98bc7f888-xtdxd             1/1     Running   7 (14m ago)    99m    10.10.5.8       worker01   <none>           <none>

Side note: I also disabled apparmor (master and workers):

sudo systemctl stop apparmor && sudo systemctl disable apparmor

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.