
本文主要记录了笔者在学习和部署K8S容器的过程中遇到的一些问题,由于笔者不是专业的DevOps,因此可能涉及到一些错误,大家在参考的过程中请关注是否解决实际问题,不要被笔者不专业的表述误导从而浪费更多的时间。
kubeadm init失败
使用kubeadm部署集群,部署过程中会有如下错误:
- [root@env ~]# kubeadm init
- [init] Using Kubernetes version: v1.24.1
- [preflight] Running pre-flight checks
- error execution phase preflight: [preflight] Some fatal errors occurred:
- [ERROR CRI]: container runtime is not running: output: time="2023-05-27T12:19:30Z" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
- , error: exit status 1
- [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
- To see the stack trace of this error execute with --v=5 or higher
问题原因
一句话原因: containerd版本过老,或containerd禁用了cri插件.
详情:
出现该问题的原因可能有两个,一个是由于containerd过老导致的,可以通过升级containerd解决.
如果已经确认安装的是最新版本,或者理最新版本很接近的版本,那么可能是由于默认的配置文件禁用了cri插件导致的:
- root@Debian-11-00-x64:~# cat /etc/containerd/config.toml
- ......
- disabled_plugins = ["cri"]
- ......
解决方案
一句话解决方案: 恢复cri插件
详情:
修改配置文件/etc/containerd/config.toml中的disabled_plugins ,去除cri插件.
coredns问题
使用kubeadm部署集群,部署的过程中,kubeadm会在kube-system命名空间下创建k8s需要的基础服务。但是笔者的1.21.2版本集群,这个镜像无法创建成功。具体表现为:pod/coredns的STATUS字段为ImagePullBackOff或ErrImagePull。
这个问题会导致,整个k8s集群中的节点,在使用“域名”访问时,会无法解析“域名”,从而访问失败。如果你的业务有用到headless service或者节点中的域名,要注意这个服务有没有问题。
笔者出问题的镜像已经不在了,用个别人的图1:

问题原因
一句话原因:源中不存在
coredns镜像。
详情:
由于众所周知的原因,相信大家在部署k8s集群的时候都会选择国内的源,我相信很多小伙伴都会用下面这个源
- registry.aliyuncs.com
然而这个源中是不存在coredns这个镜像的
- [root@env ~]# kubectl -n kube-system describe pod/coredns-6f6b8cc4f6-9c67s
- Name: coredns-6f6b8cc4f6-9c67s
- Namespace: kube-system
- …………
- Containers:
- coredns:
- Container ID: docker://e655081d120efe6a69e564e20edfa88c0e0ad2a29cb11f35870fe2cef1057fb0
- Image: registry.aliyuncs.com/google_containers/coredns:v1.8.0
- Image ID: docker-pullable://coredns/coredns@sha256:cc8fb77bc2a0541949d1d9320a641b82fd392b0d3d8145469ca4709ae769980e
- …………
- [root@env ~]# docker pullregistry.aliyuncs.com/google_containers/coredns:v1.8.0
- Error response from daemon: manifest for registry.aliyuncs.com/google_containers/coredns:v1.8.0 not found: manifest unknown: manifest unknown
解决方案
一句话解决方案:从官方源下载coredns镜像,然后本地tag成需要的镜像。
详情:
首先从官方源获取coredns的image:
- docker pull coredns/coredns:1.8.0
然后转换成本地的镜像:
- docker tag coredns/coredns:1.8.0 registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
以上操作理论上需要在所有k8s节点上执行,因为coredns的容器可能会被分配到任意节点。
NFS持久化卷unbound报错
NFS绝对是k8s中入门级的、成本最小的方案之一。然而笔者在部署使用NFS卷容器时,经常遇见容器一直Pending的状态,经过describe查询,发现是如下问题:

问题原因
一句话原因:没有响应PVC请求的“人”,导致PVC无法真正落地成磁盘上的存储区域。
详情:
在k8s体系中,所有对持久化卷的请求都是以PVC的方式存在。然而请求发出去了,谁来处理呢?有两种办法,一种是人工处理PVC请求,另外一种是使用容器自动处置PVC请求。
解决方案
网上大量搜到的教程,都是使用如下容器处理NFS的PVC请求:
- quay.io/external_storage/nfs-client-provisioner:latest
找到一篇不错的文章来描述整个部署过程2,这里贴一些关键的点:
由于nfs-client-provisioner容器需要访问k8s集群的一些状态和属性,因此需要先给一下授权,文件rbac.yaml如下:
- apiVersion: v1
- kind: ServiceAccount
- metadata:
- name: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- ---
- kind: ClusterRole
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: nfs-client-provisioner-runner
- rules:
- - apiGroups: [""]
- resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "create", "delete"]
- - apiGroups: [""]
- resources: ["persistentvolumeclaims"]
- verbs: ["get", "list", "watch", "update"]
- - apiGroups: ["storage.k8s.io"]
- resources: ["storageclasses"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
- resources: ["events"]
- verbs: ["create", "update", "patch"]
- ---
- kind: ClusterRoleBinding
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: run-nfs-client-provisioner
- subjects:
- - kind: ServiceAccount
- name: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- roleRef:
- kind: ClusterRole
- name: nfs-client-provisioner-runner
- apiGroup: rbac.authorization.k8s.io
- ---
- kind: Role
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: leader-locking-nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- rules:
- - apiGroups: [""]
- resources: ["endpoints"]
- verbs: ["get", "list", "watch", "create", "update", "patch"]
- ---
- kind: RoleBinding
- apiVersion: rbac.authorization.k8s.io/v1
- metadata:
- name: leader-locking-nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- subjects:
- - kind: ServiceAccount
- name: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- roleRef:
- kind: Role
- name: leader-locking-nfs-client-provisioner
- apiGroup: rbac.authorization.k8s.io
然后使用如下命令创建授权:
- kubectl apply -f rbac.yaml
然后创建nfs-client-provisioner POD, 首先是deployment.yaml文件:
- apiVersion: apps/v1
- kind: Deployment
- metadata:
- name: nfs-client-provisioner
- labels:
- app: nfs-client-provisioner
- # replace with namespace where provisioner is deployed
- namespace: kafka-test
- spec:
- replicas: 1
- strategy:
- type: Recreate
- selector:
- matchLabels:
- app: nfs-client-provisioner
- template:
- metadata:
- labels:
- app: nfs-client-provisioner
- spec:
- serviceAccountName: nfs-client-provisioner
- containers:
- - name: nfs-client-provisioner
- image: quay.io/external_storage/nfs-client-provisioner:latest
- volumeMounts:
- - name: nfs-client-root
- mountPath: /persistentvolumes
- env:
- - name: PROVISIONER_NAME
- value: fuseim.pri/ifs
- - name: NFS_SERVER
- value: 192.168.50.42
- - name: NFS_PATH
- value: /volume1/nfs-kafka
- volumes:
- - name: nfs-client-root
- nfs:
- server: 192.168.50.42
- path: /volume1/nfs-kafka
然后用如下命令创建:
- kubectl apply -f deployment.yaml
最后,创建对应的NFS StorageClass, storage-class.yaml文件:
- apiVersion: storage.k8s.io/v1
- kind: StorageClass
- metadata:
- name: managed-nfs-storage # 注意这里的值,要和业务容器volumeClaimTemplates的storageClassName一致
- provisioner: fuseim.pri/ifs # 注意这里的值要和nfs-client-provisioner容器的部署文件(deployment.yaml)中的 env PROVISIONER_NAME'一致
- parameters:
- archiveOnDelete: "false"
以上就是NFS卷几个比较关键的点了,然而你以为这就完了吗…………?
由NFS容器引出的k8s bug
在使用nfs-client-provisioner容器的时候,发现容器一直报这个错:
- unexpected error getting claim reference: selfLink was empty, can't make reference
这个报错会导致NFS卷无法被正确创建,进而导致业务无法正常部署。
最终发现是由于k8s 1.20.x与1.21.x版本的bug导致的3:
- It looks like newer Kubernetes (1.20 / 1.21) have deprecated selflinks and this mandates a code change in NFS provisioners. See this issue for details: https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/issues/25.
- I was able to work around this for now using the - --feature-gates=RemoveSelfLink=false work around mentioned there, but this isn't a long term solution.
解决方案
一句话解决方案:在/etc/kubernetes/manifests/kube-apiserver.yaml中新增- --feature-gates=RemoveSelfLink=false参数
详情:
涉及到的改动如下4(只需要在apiserver节点,即master节点执行此操作):
- [root@env]# cat /etc/kubernetes/manifests/kube-apiserver.yaml
- apiVersion: v1
- kind: Pod
- metadata:
- labels:
- component: kube-apiserver
- tier: control-plane
- name: kube-apiserver
- namespace: kube-system
- spec:
- containers:
- - command:
- - kube-apiserver
- …………
- - --feature-gates=RemoveSelfLink=false
- image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.21.2
- …………
volumeClaimTemplates与persistentVolumeClaim的区别
实际上volumeClaimTemplates与persistentVolumeClaim都会产生PVC,但是两者是有区别的。先一句话概括一下:volumeClaimTemplates产生的PVC名字会自动加上app name,而persistentVolumeClaim不会,persistentVolumeClaim你写什么就是什么。
来看个具体的例子。
volumeClaimTemplates
volumeClaimTemplates的使用方法:
- kind: StatefulSet
- apiVersion: apps/v1
- metadata:
- labels:
- app: rabbitmq-cluster
- name: rabbitmq-cluster
- namespace: ns-public-rabbitmq
- spec:
- replicas: 3
- selector:
- matchLabels:
- app: rabbitmq-cluster
- serviceName: rabbitmq-cluster
- template:
- metadata:
- labels:
- app: rabbitmq-cluster
- spec:
- containers:
- - args:
- ………………
- volumeMounts:
- - mountPath: /var/lib/rabbitmq
- name: rabbitmq-storage
- readOnly: false
- volumeClaimTemplates:
- - metadata:
- name: rabbitmq-storage
- spec:
- accessModes:
- - ReadWriteMany
- storageClassName: "rabbitmq-nfs-storage"
- resources:
- requests:
- storage: 2Gi
看一下volumeClaimTemplates创建的pvc:
- [root@platform01v ~]# kubectl -n ns-public-rabbitmq get pvc
- NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
- rabbitmq-storage-rabbitmq-cluster-0 Bound pvc-b75e6346-bee5-49fb-8881-308d975ae135 2Gi RWX rabbitmq-nfs-storage 40h
- rabbitmq-storage-rabbitmq-cluster-1 Bound pvc-4c8fa6a6-2818-41f4-891a-f2f389341f51 2Gi RWX rabbitmq-nfs-storage 40h
- rabbitmq-storage-rabbitmq-cluster-2 Bound pvc-941893ea-d600-40ac-a97f-ce9099ba9bbb 2Gi RWX rabbitmq-nfs-storage 40h
可以看到, NAME字段的格式为
[.spec.volumeClaimTemplates.metadata.name]-[PodName]
persistentVolumeClaim
再来看看persistentVolumeClaim创建的, 首先是如何使用:
- kind: StatefulSet
- apiVersion: apps/v1
- metadata:
- labels:
- app: workplatform-cluster
- name: workplatform-cluster
- namespace: ns-analyzer-workplatform
- spec:
- replicas: 3
- selector:
- matchLabels:
- app: workplatform-cluster
- serviceName: workplatform-cluster
- template:
- metadata:
- labels:
- app: workplatform-cluster
- spec:
- containers:
- - args:
- ………………
- volumeMounts:
- - mountPath: /exports/apps
- name: workplatform-nfs-storage
- readOnly: false
- volumes:
- - name: workplatform-nfs-storage
- persistentVolumeClaim:
- claimName: workplatform-nfs-pvc
看一下persistentVolumeClaim创建的PVC:
- [root@env work-platform]# kubectl -n ns-analyzer-workplatform get pvc
- NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
- workplatform-nfs-pvc Bound workplatform-nfs-pv 5Gi RWX nfs-workplatform-storage 14h
可以看到NAME没有做任何修改,为指定的
[.spec.volumes.persistentVolumeClaim.claimName]
这个差异会导致,容器挂在的目录是不一样的,使用volumeClaimTemplates的集群会通过nfs-client-provisioner容器为每个Pod创建一个文件夹并挂载,而使用persistentVolumeClaim会为所有的Pod挂载同一个文件夹:

部署Flannel不成功
在使用kubectl apply -f kube-flannel.yml后,发现kube-flannel命名空间下的容器一直是未正常运行的状态:
- # kubectl -n kube-flannel get all
- NAME READY STATUS RESTARTS AGE
- kube-flannel-ds-22sb4 0/1 Init:ImagePullBackOff 7 51m
通过describe命令查看服务,有如下错误:
- # kubectl -n kube-flannel describe daemonset.apps/kube-flannel-ds
- Warning FailedCreate 111s (x17 over 7m19s) daemonset-controller Error creating: pods "kube-flannel-ds-" is forbidden: pods with system-node-critical priorityClass is not permitted in kube-flannel namespace
问题原因
一句话原因: kube-flannel.yml中有对priorityClassName的描述,未知原因导致在当前k8s中没有这个类(可能是我用的低版本k8s的原因?)。
详情:
查看kube-flannel.yml文件,描述如下(有部份省略):
- apiVersion: apps/v1
- kind: DaemonSet
- metadata:
- name: kube-flannel-ds
- namespace: kube-flannel
- labels:
- tier: node
- app: flannel
- k8s-app: flannel
- spec:
- template:
- spec:
- ……
- priorityClassName: system-node-critical
- ……
而当前系统中并没有此类,导致不满足要求的条件从而导致出现问题。
解决方案
一句话解决方案: 删除priorityClassName: system-node-critical约束,或创建自定义类。
详情:
最简单的方法辨识删除kube-flannel.yml的相关约束, 或者也可以创建一个自定义的priorityClassName类,然后将yml中的类改为自己创建的自定义类即可:
- # cat flannel-priority-class.yml
- apiVersion: scheduling.k8s.io/v1
- description: Used for flannel critical pods that must not be moved from their current node.
- kind: PriorityClass
- metadata:
- name: flannel-node-critical
- preemptionPolicy: PreemptLowerPriority
- value: 1000000
- globalDefault: false
- # kubectl apply -f flannel-priority-class.yml
然后将kube-flannel.yml中的priorityClassName改为flannel-node-critical并重新apply即可。