Watch & Learn

Debugwar Blog

Step in or Step over, this is a problem ...

[NOTE] Problems and notes during learning kubernetes

2021-07-07 @ UTC+0


This article mainly records some problems encountered by the author during the process of learning and deploying K8S containers. Since the author is not a professional DevOps, there may be some errors involved. Everyone please pay attention to whether it solves the actual problem during the reference process. Don’t be misled by the author’s unprofessional expression and waste more time.

Kubeadm init failed

When deploying a cluster using kubeadm, the following errors may occur during the deployment process:

  1. [root@env ~]# kubeadm init
  2. [init] Using Kubernetes version: v1.24.1
  3. [preflight] Running pre-flight checks
  4. error execution phase preflight: [preflight] Some fatal errors occurred:
  5.         [ERROR CRI]: container runtime is not running: output: time="2023-05-27T12:19:30Z" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
  6. , error: exit status 1
  7. [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
  8. To see the stack trace of this error execute with --v=5 or higher

Problem Cause

In a nutshell: The containerd version is too old, or the cri plugin has been disabled in containerd.
Details:

There may be two reasons for this problem. One is caused by the outdated containerd, which can be solved by upgrading containerd.

If it is confirmed that the installed version is the latest version, or very close to the latest version, then it may be caused by the default configuration file disabled the cri plugin:

  1. root@Debian-11-00-x64:~# cat /etc/containerd/config.toml
  2. ......
  3. disabled_plugins = ["cri"]
  4. ......

Solution 

In a nutshell:  Restore disabled cri plugin.
Details:

Modify the disabled_plugins in the configuration file /etc/containerd/config.toml to remove the cri plugin.

Coredns problem

When deploying a cluster using kubeadm, kubeadm will create the basic services needed by k8s in the kube-system namespace during the deployment process. However, in the author’s 1.21.2 version cluster, this image cannot be created successfully. The specific performance is: the STATUS field of pod/coredns is ImagePullBackOff or ErrImagePull.

This problem will cause the nodes in the entire k8s cluster to fail to resolve the “domain name” when using “domain name” access, enventually it will cause an access failure. If your business uses headless service or domain names in the node, pay attention to whether there is a problem with this service.

The image that caused the problem for the author is gone, use someone else’s picture 1:


Problem Cause

In a nutshell:There is no coredns image in the source.

Details: Due to well-known reasons, I believe everyone will choose a domestic source when deploying a k8s cluster. I believe many dudes will use the following source:

  1. registry.aliyuncs.com  

However, there is no coredns image in this source.

  1. [root@env ~]# kubectl -n kube-system describe pod/coredns-6f6b8cc4f6-9c67s  
  2. Name:                 coredns-6f6b8cc4f6-9c67s  
  3. Namespace:            kube-system  
  4. …………  
  5. Containers:  
  6.   coredns:  
  7.     Container ID:  docker://e655081d120efe6a69e564e20edfa88c0e0ad2a29cb11f35870fe2cef1057fb0  
  8.     Image:       registry.aliyuncs.com/google_containers/coredns:v1.8.0  
  9.     Image ID:      docker-pullable://coredns/coredns@sha256:cc8fb77bc2a0541949d1d9320a641b82fd392b0d3d8145469ca4709ae769980e  
  10. …………  
  11.   
  12. [root@env ~]# docker pullregistry.aliyuncs.com/google_containers/coredns:v1.8.0  
  13. Error response from daemon: manifest for registry.aliyuncs.com/google_containers/coredns:v1.8.0 not found: manifest unknown: manifest unknown  

Solution


In a nutshell:Download the coredns image from the official source, and then tag it into the required image locally.

Details:

First get the coredns image from the official source:

  1. docker pull coredns/coredns:1.8.0    

Then convert it into a local image:

  1. docker tag coredns/coredns:1.8.0 registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0  

The above operations theoretically need to be performed on all k8s nodes, because the coredns container may be allocated to any node.

NFS persistent volume unbound error

NFS is definitely one of the entry-level and least-cost solutions in k8s. However, when the author deploys and uses the NFS volume container, he often encounters the container always in the Pending state. After describe query, it is found that the problem is as follows:


Problem Cause


In a nutshell: There is no “person” responding to the PVC request, causing the PVC to not really land into the storage area on the disk.

Details:

In the k8s system, all requests for persistent volumes exist in the form of PVC. However, after the request is sent out, who will handle it? There are two methods, one is to manually handle PVC requests, and the other is to use containers to automatically dispose of PVC requests.

Solution

A large number of tutorials searched on the Internet all use the following containers to handle NFS’s PVC requests:

  1. quay.io/external_storage/nfs-client-provisioner:latest

Find a good article to describe the entire deployment process 2, here are some key points:

Since the nfs-client-provisioner container needs to access some states and attributes of the k8s cluster, it needs to be authorized first, and the file rbac.yaml is as follows:

  1. apiVersion: v1  
  2. kind: ServiceAccount  
  3. metadata:  
  4.   name: nfs-client-provisioner  
  5.   # replace with namespace where provisioner is deployed  
  6.   namespace: kafka-test  
  7. ---  
  8. kind: ClusterRole  
  9. apiVersion: rbac.authorization.k8s.io/v1  
  10. metadata:  
  11.   name: nfs-client-provisioner-runner  
  12. rules:  
  13.   - apiGroups: [""]  
  14.     resources: ["persistentvolumes"]  
  15.     verbs: ["get""list""watch""create""delete"]  
  16.   - apiGroups: [""]  
  17.     resources: ["persistentvolumeclaims"]  
  18.     verbs: ["get""list""watch""update"]  
  19.   - apiGroups: ["storage.k8s.io"]  
  20.     resources: ["storageclasses"]  
  21.     verbs: ["get""list""watch"]  
  22.   - apiGroups: [""]  
  23.     resources: ["events"]  
  24.     verbs: ["create""update""patch"]  
  25. ---  
  26. kind: ClusterRoleBinding  
  27. apiVersion: rbac.authorization.k8s.io/v1  
  28. metadata:  
  29.   name: run-nfs-client-provisioner  
  30. subjects:  
  31.   - kind: ServiceAccount  
  32.     name: nfs-client-provisioner  
  33.     # replace with namespace where provisioner is deployed  
  34.     namespace: kafka-test  
  35. roleRef:  
  36.   kind: ClusterRole  
  37.   name: nfs-client-provisioner-runner  
  38.   apiGroup: rbac.authorization.k8s.io  
  39. ---  
  40. kind: Role  
  41. apiVersion: rbac.authorization.k8s.io/v1  
  42. metadata:  
  43.   name: leader-locking-nfs-client-provisioner  
  44.   # replace with namespace where provisioner is deployed  
  45.   namespace: kafka-test  
  46. rules:  
  47.   - apiGroups: [""]  
  48.     resources: ["endpoints"]  
  49.     verbs: ["get""list""watch""create""update""patch"]  
  50. ---  
  51. kind: RoleBinding  
  52. apiVersion: rbac.authorization.k8s.io/v1  
  53. metadata:  
  54.   name: leader-locking-nfs-client-provisioner  
  55.   # replace with namespace where provisioner is deployed  
  56.   namespace: kafka-test  
  57. subjects:  
  58.   - kind: ServiceAccount  
  59.     name: nfs-client-provisioner  
  60.     # replace with namespace where provisioner is deployed  
  61.     namespace: kafka-test  
  62. roleRef:  
  63.   kind: Role  
  64.   name: leader-locking-nfs-client-provisioner  
  65.   apiGroup: rbac.authorization.k8s.io  

Then use the following command to create authorization:

  1. kubectl apply -f rbac.yaml  

Then create the nfs-client-provisioner POD, first is the deployment.yaml file:

  1. apiVersion: apps/v1  
  2. kind: Deployment  
  3. metadata:  
  4.   name: nfs-client-provisioner  
  5.   labels:  
  6.     app: nfs-client-provisioner  
  7.   # replace with namespace where provisioner is deployed  
  8.   namespace: kafka-test  
  9. spec:  
  10.   replicas: 1  
  11.   strategy:  
  12.     type: Recreate  
  13.   selector:  
  14.     matchLabels:  
  15.       app: nfs-client-provisioner  
  16.   template:  
  17.     metadata:  
  18.       labels:  
  19.         app: nfs-client-provisioner  
  20.     spec:  
  21.       serviceAccountName: nfs-client-provisioner  
  22.       containers:  
  23.         - name: nfs-client-provisioner  
  24.           image: quay.io/external_storage/nfs-client-provisioner:latest  
  25.           volumeMounts:  
  26.             - name: nfs-client-root  
  27.               mountPath: /persistentvolumes  
  28.           env:  
  29.             - name: PROVISIONER_NAME  
  30.               value: fuseim.pri/ifs  
  31.             - name: NFS_SERVER  
  32.               value: 192.168.50.42  
  33.             - name: NFS_PATH  
  34.               value: /volume1/nfs-kafka  
  35.       volumes:  
  36.         - name: nfs-client-root  
  37.           nfs:  
  38.             server: 192.168.50.42  
  39.             path: /volume1/nfs-kafka  

Then use the following command to create:

  1. kubectl apply -f deployment.yaml  

Finally, create the corresponding NFS StorageClass, the storage-class.yaml file:

  1. apiVersion: storage.k8s.io/v1  
  2. kind: StorageClass  
  3. metadata:  
  4.   name: managed-nfs-storage  # Note that this value should be consistent with the storageClassName of volumeClaimTemplates in the business container
  5. provisioner: fuseim.pri/ifs # Note that this value should be consistent with the 'env PROVISIONER_NAME' in the deployment file (deployment.yaml) of the nfs-client-provisioner container
  6. parameters:  
  7.   archiveOnDelete: "false"  

That’s all about the key points of NFS volume, but do you think it’s over…?

K8s bug caused by NFS container

When using the nfs-client-provisioner container, I found that the container has been reporting this error:

  1. unexpected error getting claim reference: selfLink was empty, can't make reference   

This error will cause the NFS volume to not be created correctly, which in turn will cause the business to not be deployed normally.

It was finally discovered that it was caused by a bug in k8s 1.20.x and 1.21.x versions 3:

  1. It looks like newer Kubernetes (1.20 / 1.21) have deprecated selflinks and this mandates a code change in NFS provisioners. See this issue for details: https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/issues/25.  
  2.   
  3. I was able to work around this for now using the - --feature-gates=RemoveSelfLink=false work around mentioned there, but this isn't a long term solution.  

Solution

In a nutshell: Add - --feature-gates=RemoveSelfLink=false parameter in /etc/kubernetes/manifests/kube-apiserver.yam

Details:

The changes involved are as follows 4 (only need to be performed on the apiserver node, that is, the master node):

  1. [root@env]# cat /etc/kubernetes/manifests/kube-apiserver.yaml   
  2. apiVersion: v1  
  3. kind: Pod  
  4. metadata:  
  5.   labels:  
  6.     component: kube-apiserver  
  7.     tier: control-plane  
  8.   name: kube-apiserver  
  9.   namespace: kube-system  
  10. spec:  
  11.   containers:  
  12.   - command:  
  13.     - kube-apiserver  
  14.     …………  
  15.     - --feature-gates=RemoveSelfLink=false  
  16.     image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.21.2  
  17.     …………  

The difference between volumeClaimTemplates and persistentVolumeClaim

In fact, both volumeClaimTemplates and persistentVolumeClaim will produce PVC, but there is a difference between the two. To sum it up in one sentence: the PVC name generated by volumeClaimTemplates will automatically add the app name, while persistentVolumeClaim will not, persistentVolumeClaim what you write is what it is.

Let’s look at a specific example.

volumeClaimTemplates


The usage of volumeClaimTemplates:

  1. kind: StatefulSet  
  2. apiVersion: apps/v1  
  3. metadata:  
  4.   labels:  
  5.     app: rabbitmq-cluster  
  6.   name: rabbitmq-cluster  
  7.   namespace: ns-public-rabbitmq  
  8. spec:  
  9.   replicas: 3  
  10.   selector:  
  11.     matchLabels:  
  12.       app: rabbitmq-cluster  
  13.   serviceName: rabbitmq-cluster  
  14.   template:  
  15.     metadata:  
  16.       labels:  
  17.         app: rabbitmq-cluster  
  18.     spec:  
  19.       containers:  
  20.       - args:  
  21.         ………………  
  22.         volumeMounts:  
  23.         - mountPath: /var/lib/rabbitmq  
  24.           name: rabbitmq-storage  
  25.           readOnly: false  
  26.   volumeClaimTemplates:  
  27.   - metadata:  
  28.       name: rabbitmq-storage  
  29.     spec:  
  30.       accessModes:  
  31.       - ReadWriteMany  
  32.       storageClassName: "rabbitmq-nfs-storage"  
  33.       resources:  
  34.         requests:  
  35.           storage: 2Gi  

Look at the pvc created by volumeClaimTemplates:

  1. [root@platform01v ~]# kubectl -n ns-public-rabbitmq get pvc  
  2. NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE  
  3. rabbitmq-storage-rabbitmq-cluster-0   Bound    pvc-b75e6346-bee5-49fb-8881-308d975ae135   2Gi        RWX            rabbitmq-nfs-storage   40h  
  4. rabbitmq-storage-rabbitmq-cluster-1   Bound    pvc-4c8fa6a6-2818-41f4-891a-f2f389341f51   2Gi        RWX            rabbitmq-nfs-storage   40h  
  5. rabbitmq-storage-rabbitmq-cluster-2   Bound    pvc-941893ea-d600-40ac-a97f-ce9099ba9bbb   2Gi        RWX            rabbitmq-nfs-storage   40h  

You can see that the NAME field format is:

[.spec.volumeClaimTemplates.metadata.name]-[PodName]

persistentVolumeClaim


Let’s take a look at the creation of persistentVolumeClaim, first of all, how to use it:

  1. kind: StatefulSet  
  2. apiVersion: apps/v1  
  3. metadata:  
  4.   labels:  
  5.     app: workplatform-cluster  
  6.   name: workplatform-cluster  
  7.   namespace: ns-analyzer-workplatform  
  8. spec:  
  9.   replicas: 3  
  10.   selector:  
  11.     matchLabels:  
  12.       app: workplatform-cluster  
  13.   serviceName: workplatform-cluster  
  14.   template:  
  15.     metadata:  
  16.       labels:  
  17.         app: workplatform-cluster  
  18.     spec:  
  19.       containers:  
  20.       - args:  
  21.         ………………  
  22.         volumeMounts:  
  23.         - mountPath: /exports/apps  
  24.           name: workplatform-nfs-storage  
  25.           readOnly: false  
  26.       volumes:  
  27.         - name: workplatform-nfs-storage  
  28.           persistentVolumeClaim:  
  29.             claimName: workplatform-nfs-pvc  

Look at the PVC created by persistentVolumeClaim:

  1. [root@env work-platform]# kubectl -n ns-analyzer-workplatform get pvc  
  2. NAME                   STATUS   VOLUME                CAPACITY   ACCESS MODES   STORAGECLASS               AGE  
  3. workplatform-nfs-pvc   Bound    workplatform-nfs-pv   5Gi        RWX            nfs-workplatform-storage   14h  

You can see that NAME has not made any modifications, it is the specified.

[.spec.volumes.persistentVolumeClaim.claimName]

This difference will cause the directory mounted by the container to be different. The cluster using volumeClaimTemplates will create a folder for each Pod through the nfs-client-provisioner container and mount it, while using persistentVolumeClaim will mount the same folder for all Pods:


Flannel deployment unsuccessful

After using kubectl apply -f kube-flannel.yml, it was found that the containers under the kube-flannel namespace were always in an abnormal running state:

  1. # kubectl -n kube-flannel get all  
  2. NAME                             READY   STATUS                  RESTARTS   AGE  
  3. kube-flannel-ds-22sb4            0/1     Init:ImagePullBackOff   7          51m  

Through the describe command to view the service, there are the following errors:

  1. # kubectl -n kube-flannel describe daemonset.apps/kube-flannel-ds   
  2.   Warning  FailedCreate  111s (x17 over 7m19s)  daemonset-controller  Error creating: pods "kube-flannel-ds-" is forbidden: pods with system-node-critical priorityClass is not permitted in kube-flannel namespace  

Problem Cause

In a nutshell:  There is a description of priorityClassName in kube-flannel.yml, and for unknown reasons, there is no such class in the current k8s (maybe because I am using a low version of k8s?).

Details:
Check the kube-flannel.yml file, the description is as follows (partially omitted):

  1. apiVersion: apps/v1  
  2. kind: DaemonSet  
  3. metadata:  
  4.   name: kube-flannel-ds  
  5.   namespace: kube-flannel  
  6.   labels:  
  7.     tier: node  
  8.     app: flannel  
  9.     k8s-app: flannel  
  10. spec:  
  11.   template:  
  12.     spec:  
  13.       ……  
  14.       priorityClassName: system-node-critical  
  15.       ……  

And there is no such class in the current system, which does not meet the required conditions, which leads to problems.

Solution

In a nutshell: Delete the priorityClassName: system-node-critical constraint, or create a custom class. 

Details:
The simplest method is to delete the relevant constraints of kube-flannel.yml, or you can also create a custom priorityClassName class, and then change the class in yml to your own custom class:

  1. # cat flannel-priority-class.yml  
  2. apiVersion: scheduling.k8s.io/v1  
  3. description: Used for flannel critical pods that must not be moved from their current node.  
  4. kind: PriorityClass  
  5. metadata:  
  6.   name: flannel-node-critical  
  7. preemptionPolicy: PreemptLowerPriority  
  8. value: 1000000  
  9. globalDefault: false   
  10.   
  11. # kubectl apply -f flannel-priority-class.yml  

Then change the priorityClassName in kube-flannel.yml to flannel-node-critical and reapply.

Reference

  1. Solution to the problem of ImagePullBackOff and ErrImagePull of coredns in k8s
  2. K8S’s StorageClass combat (NFS)
  3. unexpected error getting claim reference: selfLink was empty, can't make reference
  4. Configuration of NFS dynamic storage in k8s 1.20.x version
Catalog
Kubeadm init failed
Problem Cause
Solution 
Coredns problem
Problem Cause
Solution
NFS persistent volume unbound error
Problem Cause
Solution
K8s bug caused by NFS container
Solution
The difference between volumeClaimTemplates and persistentVolumeClaim
volumeClaimTemplates
persistentVolumeClaim
Flannel deployment unsuccessful
Problem Cause
Solution
Reference

CopyRight (c) 2020 - 2025 Debugwar.com

Designed by Hacksign