K8s 集群高可用master节点ETCD全部挂掉如何恢复?

写在前面

博文内容涉及集群 ETCD 全部挂掉，通过备份文件恢复的操作 Demo
理解不足小伙伴帮忙指正 😃,生活加油

不必太纠结于当下，也不必太忧虑未来，当你经历过一些事情的时候，眼前的风景已经和从前不一样了。——村上春树

前提是需要etcd备份文件，如果没有 etcd 备份，或者其他的备份手段，可能 GG 了

这里默认需要使用 etcdctl 的地方已经安装了该工具

备份文件分享

分享一个备份脚本

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份
#@Contact :   1224965096@qq.com

if [ ! -d /root/back/ ];then
   mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379"  \
--cert="/etc/kubernetes/pki/etcd/server.crt"  \
--key="/etc/kubernetes/pki/etcd/server.key"  \
--cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
snapshot save /root/back/snap-${STR_DATE}.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db

sudo chmod  o-w,u-w,g-w  /root/back/snap-${STR_DATE}.db

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

运行方式

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 |       5999 |      88 MB |
+----------+----------+------------+------------+

生成对应的备份数据

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051145.db

可以使用 systemd 配置成 service unit

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

主要是方便看日志，方便管理

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl start  etcd-backup.service
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 |       3753 |      88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月   5 11:49 /root/back/snap-202406051149.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

然后使用 timer unit 配置为定时启动

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

同样可以看日志

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u  etcd-backup.timer
-- No entries --

故障处理恢复

故障表象，集群整个崩了，所有 master 上的 etcd 和 apiserver 都死掉了

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get  pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?

移动 etcd 和 apiserver 的对应静态 pod 的 yaml 文件。关于静态 Pod 运行原理这里不多讲，感兴趣小伙伴可以官网看下

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m  shell -a "mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

清除当前集群的 etcd 的数据文件和对应的目录

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

拷贝备份文件到当前集群的每个 etcd 节点

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m  copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
    "state": "file",
    "uid": 0
}
192.168.26.101 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
    "state": "file",
    "uid": 0
}
192.168.26.102 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
    "state": "file",
    "uid": 0
}
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

确定拷贝文件的备份文件

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

在其中一个节点执行备份恢复命令

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$vim etcd_break.sh
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

备份恢复命令

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
 --name vms100.liruilongs.github.io  \
 --cert="/etc/kubernetes/pki/etcd/server.crt" \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
 --endpoints="https://127.0.0.1:2379" \
 --initial-advertise-peer-urls="https://192.168.26.100:2380"  \
 --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
 --data-dir=/var/lib/etcd

再次执行，备份恢复成功

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

其他的etcd节点备份恢复，需要修改脚本两个地方：

–name vms100.liruilongs.github.io
–initial-advertise-peer-urls=“https://192.168.26.100:2380”

192.168.26.101 节点执行

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

192.168.26.102 节点执行

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert=
"/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

移动静态 Pod对应的 yaml 文件，恢复 etcd 和apiserver 对应的Pod

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确认静态pod 恢复

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

查看 etcd 集群节点状态

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

确认集群是否恢复

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          495d   v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$