
This article mainly records some problems encountered by the author during the process of learning and deploying K8S containers. Since the author is not a professional DevOps, there may be some errors involved. Everyone please pay attention to whether it solves the actual problem during the reference process. Don’t be misled by the author’s unprofessional expression and waste more time.
Kubeadm init failed
When deploying a cluster using kubeadm, the following errors may occur during the deployment process:
- [root@env ~]# kubeadm init
- [init] Using Kubernetes version: v1.24.1
- [preflight] Running pre-flight checks
- error execution phase preflight: [preflight] Some fatal errors occurred:
- [ERROR CRI]: container runtime is not running: output: time="2023-05-27T12:19:30Z" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
- , error: exit status 1
- [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
- To see the stack trace of this error execute with --v=5 or higher
Problem Cause
In a nutshell: The containerd version is too old, or the cri plugin has been disabled in containerd.
Details:
There may be two reasons for this problem. One is caused by the outdated containerd, which can be solved by upgrading containerd.
If it is confirmed that the installed version is the latest version, or very close to the latest version, then it may be caused by the default configuration file disabled the cri plugin:
- root@Debian-11-00-x64:~# cat /etc/containerd/config.toml
- ......
- disabled_plugins = ["cri"]
- ......
Solution
In a nutshell: Restore disabled cri plugin.
Details:
Modify the disabled_plugins in the configuration file /etc/containerd/config.toml to remove the cri plugin.
Coredns problem
When deploying a cluster using kubeadm, kubeadm will create the basic services needed by k8s in the kube-system namespace during the deployment process. However, in the author’s 1.21.2 version cluster, this image cannot be created successfully. The specific performance is: the STATUS field of pod/coredns is ImagePullBackOff or ErrImagePull.
This problem will cause the nodes in the entire k8s cluster to fail to resolve the “domain name” when using “domain name” access, enventually it will cause an access failure. If your business uses headless service or domain names in the node, pay attention to whether there is a problem with this service.
The image that caused the problem for the author is gone, use someone else’s picture 1:

Problem Cause
In a nutshell:There is no coredns image in the source.
Details: Due to well-known reasons, I believe everyone will choose a domestic source when deploying a k8s cluster. I believe many dudes will use the following source:
- registry.aliyuncs.com
However, there is no coredns image in this source.
- [root@env ~]# kubectl -n kube-system describe pod/coredns-6f6b8cc4f6-9c67s
- Name: coredns-6f6b8cc4f6-9c67s
- Namespace: kube-system
- …………
- Containers:
- coredns:
- Container ID: docker://e655081d120efe6a69e564e20edfa88c0e0ad2a29cb11f35870fe2cef1057fb0
- Image: registry.aliyuncs.com/google_containers/coredns:v1.8.0
- Image ID: docker-pullable://coredns/coredns@sha256:cc8fb77bc2a0541949d1d9320a641b82fd392b0d3d8145469ca4709ae769980e
- …………
- [root@env ~]# docker pullregistry.aliyuncs.com/google_containers/coredns:v1.8.0
- Error response from daemon: manifest for registry.aliyuncs.com/google_containers/coredns:v1.8.0 not found: manifest unknown: manifest unknown
Solution
In a nutshell:Download the coredns image from the official source, and then tag it into the required image locally.
Details:
First get the coredns image from the official source:
- docker pull coredns/coredns:1.8.0
Then convert it into a local image:
- docker tag coredns/coredns:1.8.0 registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
The above operations theoretically need to be performed on all k8s nodes, because the coredns container may be allocated to any node.
NFS persistent volume unbound error
NFS is definitely one of the entry-level and least-cost solutions in k8s. However, when the author deploys and uses the NFS volume container, he often encounters the container always in the Pending state. After describe query, it is found that the problem is as follows:

Problem Cause
In a nutshell: There is no “person” responding to the PVC request, causing the PVC to not really land into the storage area on the disk.
Details:
In the k8s system, all requests for persistent volumes exist in the form of PVC. However, after the request is sent out, who will handle it? There are two methods, one is to manually handle PVC requests, and the other is to use containers to automatically dispose of PVC requests.
Solution
A large number of tutorials searched on the Internet all use the following containers to handle NFS’s PVC requests:
- quay.io/external_storage/nfs-client-provisioner:latest
Find a good article to describe the entire deployment process 2, here are some key points:
Since the nfs-client-provisioner container needs to access some states and attributes of the k8s cluster, it needs to be authorized first, and the file rbac.yaml is as follows:
- apiVersion: v1
- kind: ServiceAccount
- metadata:
- name: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- ---
- kind: ClusterRole
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: nfs-client-provisioner-runner
- rules:
- - apiGroups: [""]
- resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "create", "delete"]
- - apiGroups: [""]
- resources: ["persistentvolumeclaims"]
- verbs: ["get", "list", "watch", "update"]
- - apiGroups: ["storage.k8s.io"]
- resources: ["storageclasses"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
- resources: ["events"]
- verbs: ["create", "update", "patch"]
- ---
- kind: ClusterRoleBinding
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: run-nfs-client-provisioner
- subjects:
- - kind: ServiceAccount
- name: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- roleRef:
- kind: ClusterRole
- name: nfs-client-provisioner-runner
- apiGroup: rbac.authorization.k8s.io
- ---
- kind: Role
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: leader-locking-nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- rules:
- - apiGroups: [""]
- resources: ["endpoints"]
- verbs: ["get", "list", "watch", "create", "update", "patch"]
- ---
- kind: RoleBinding
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: leader-locking-nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- subjects:
- - kind: ServiceAccount
- name: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- roleRef:
- kind: Role
- name: leader-locking-nfs-client-provisioner
- apiGroup: rbac.authorization.k8s.io
Then use the following command to create authorization:
- kubectl apply -f rbac.yaml
Then create the nfs-client-provisioner POD, first is the deployment.yaml file:
- apiVersion: apps/v1
- kind: Deployment
- metadata:
- name: nfs-client-provisioner
- labels:
- app: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- spec:
- replicas: 1
- strategy:
- type: Recreate
- selector:
- matchLabels:
- app: nfs-client-provisioner
- template:
- metadata:
- labels:
- app: nfs-client-provisioner
- spec:
- serviceAccountName: nfs-client-provisioner
- containers:
- - name: nfs-client-provisioner
- image: quay.io/external_storage/nfs-client-provisioner:latest
- volumeMounts:
- - name: nfs-client-root
- mountPath: /persistentvolumes
- env:
- - name: PROVISIONER_NAME
- value: fuseim.pri/ifs
- - name: NFS_SERVER
- value: 192.168.50.42
- - name: NFS_PATH
- value: /volume1/nfs-kafka
- volumes:
- - name: nfs-client-root
- nfs:
- server: 192.168.50.42
- path: /volume1/nfs-kafka
Then use the following command to create:
- kubectl apply -f deployment.yaml
Finally, create the corresponding NFS StorageClass, the storage-class.yaml file:
- apiVersion: storage.k8s.io/v1
- kind: StorageClass
- metadata:
- name: managed-nfs-storage # Note that this value should be consistent with the storageClassName of volumeClaimTemplates in the business container
- provisioner: fuseim.pri/ifs # Note that this value should be consistent with the 'env PROVISIONER_NAME' in the deployment file (deployment.yaml) of the nfs-client-provisioner container
- parameters:
- archiveOnDelete: "false"
That’s all about the key points of NFS volume, but do you think it’s over…?
K8s bug caused by NFS container
When using the nfs-client-provisioner container, I found that the container has been reporting this error:
- unexpected error getting claim reference: selfLink was empty, can't make reference
This error will cause the NFS volume to not be created correctly, which in turn will cause the business to not be deployed normally.
It was finally discovered that it was caused by a bug in k8s 1.20.x and 1.21.x versions 3:
- It looks like newer Kubernetes (1.20 / 1.21) have deprecated selflinks and this mandates a code change in NFS provisioners. See this issue for details: https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/issues/25.
- I was able to work around this for now using the - --feature-gates=RemoveSelfLink=false work around mentioned there, but this isn't a long term solution.
Solution
In a nutshell: Add - --feature-gates=RemoveSelfLink=false parameter in /etc/kubernetes/manifests/kube-apiserver.yam
Details:
The changes involved are as follows 4 (only need to be performed on the apiserver node, that is, the master node):
- [root@env]# cat /etc/kubernetes/manifests/kube-apiserver.yaml
- apiVersion: v1
- kind: Pod
- metadata:
- labels:
- component: kube-apiserver
- tier: control-plane
- name: kube-apiserver
- namespace: kube-system
- spec:
- containers:
- - command:
- - kube-apiserver
- …………
- - --feature-gates=RemoveSelfLink=false
- image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.21.2
- …………
The difference between volumeClaimTemplates and persistentVolumeClaim
In fact, both volumeClaimTemplates and persistentVolumeClaim will produce PVC, but there is a difference between the two. To sum it up in one sentence: the PVC name generated by volumeClaimTemplates will automatically add the app name, while persistentVolumeClaim will not, persistentVolumeClaim what you write is what it is.
Let’s look at a specific example.
volumeClaimTemplates
The usage of volumeClaimTemplates:
- kind: StatefulSet
- apiVersion: apps/v1
- metadata:
- labels:
- app: rabbitmq-cluster
- name: rabbitmq-cluster
- namespace: ns-public-rabbitmq
- spec:
- replicas: 3
- selector:
- matchLabels:
- app: rabbitmq-cluster
- serviceName: rabbitmq-cluster
- template:
- metadata:
- labels:
- app: rabbitmq-cluster
- spec:
- containers:
- - args:
- ………………
- volumeMounts:
- - mountPath: /var/lib/rabbitmq
- name: rabbitmq-storage
- readOnly: false
- volumeClaimTemplates:
- - metadata:
- name: rabbitmq-storage
- spec:
- accessModes:
- - ReadWriteMany
- storageClassName: "rabbitmq-nfs-storage"
- resources:
- requests:
- storage: 2Gi
Look at the pvc created by volumeClaimTemplates:
- [root@platform01v ~]# kubectl -n ns-public-rabbitmq get pvc
- NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
- rabbitmq-storage-rabbitmq-cluster-0 Bound pvc-b75e6346-bee5-49fb-8881-308d975ae135 2Gi RWX rabbitmq-nfs-storage 40h
- rabbitmq-storage-rabbitmq-cluster-1 Bound pvc-4c8fa6a6-2818-41f4-891a-f2f389341f51 2Gi RWX rabbitmq-nfs-storage 40h
- rabbitmq-storage-rabbitmq-cluster-2 Bound pvc-941893ea-d600-40ac-a97f-ce9099ba9bbb 2Gi RWX rabbitmq-nfs-storage 40h
You can see that the NAME field format is:
[.spec.volumeClaimTemplates.metadata.name]-[PodName]
persistentVolumeClaim
Let’s take a look at the creation of persistentVolumeClaim, first of all, how to use it:
- kind: StatefulSet
- apiVersion: apps/v1
- metadata:
- labels:
- app: workplatform-cluster
- name: workplatform-cluster
- namespace: ns-analyzer-workplatform
- spec:
- replicas: 3
- selector:
- matchLabels:
- app: workplatform-cluster
- serviceName: workplatform-cluster
- template:
- metadata:
- labels:
- app: workplatform-cluster
- spec:
- containers:
- - args:
- ………………
- volumeMounts:
- - mountPath: /exports/apps
- name: workplatform-nfs-storage
- readOnly: false
- volumes:
- - name: workplatform-nfs-storage
- persistentVolumeClaim:
- claimName: workplatform-nfs-pvc
Look at the PVC created by persistentVolumeClaim:
- [root@env work-platform]# kubectl -n ns-analyzer-workplatform get pvc
- NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
- workplatform-nfs-pvc Bound workplatform-nfs-pv 5Gi RWX nfs-workplatform-storage 14h
You can see that NAME has not made any modifications, it is the specified.
[.spec.volumes.persistentVolumeClaim.claimName]
This difference will cause the directory mounted by the container to be different. The cluster using volumeClaimTemplates will create a folder for each Pod through the nfs-client-provisioner container and mount it, while using persistentVolumeClaim will mount the same folder for all Pods:

Flannel deployment unsuccessful
After using kubectl apply -f kube-flannel.yml, it was found that the containers under the kube-flannel namespace were always in an abnormal running state:
- # kubectl -n kube-flannel get all
- NAME READY STATUS RESTARTS AGE
- kube-flannel-ds-22sb4 0/1 Init:ImagePullBackOff 7 51m
Through the describe command to view the service, there are the following errors:
- # kubectl -n kube-flannel describe daemonset.apps/kube-flannel-ds
- Warning FailedCreate 111s (x17 over 7m19s) daemonset-controller Error creating: pods "kube-flannel-ds-" is forbidden: pods with system-node-critical priorityClass is not permitted in kube-flannel namespace
Problem Cause
In a nutshell: There is a description of priorityClassName in kube-flannel.yml, and for unknown reasons, there is no such class in the current k8s (maybe because I am using a low version of k8s?).
Details:
Check the kube-flannel.yml file, the description is as follows (partially omitted):
- apiVersion: apps/v1
- kind: DaemonSet
- metadata:
- name: kube-flannel-ds
- namespace: kube-flannel
- labels:
- tier: node
- app: flannel
- k8s-app: flannel
- spec:
- template:
- spec:
- ……
- priorityClassName: system-node-critical
- ……
And there is no such class in the current system, which does not meet the required conditions, which leads to problems.
Solution
In a nutshell: Delete the priorityClassName: system-node-critical constraint, or create a custom class.
Details:
The simplest method is to delete the relevant constraints of kube-flannel.yml, or you can also create a custom priorityClassName class, and then change the class in yml to your own custom class:
- # cat flannel-priority-class.yml
- apiVersion: scheduling.k8s.io/v1
- description: Used for flannel critical pods that must not be moved from their current node.
- kind: PriorityClass
- metadata:
- name: flannel-node-critical
- preemptionPolicy: PreemptLowerPriority
- value: 1000000
- globalDefault: false
- # kubectl apply -f flannel-priority-class.yml
Then change the priorityClassName in kube-flannel.yml to flannel-node-critical and reapply.