# etcd备份恢复以及碎片清理 ## 简述 ### Etcd是什么 etcd是一个分布式、一致性的键值存储系统,主要用于配置共享和服务发现。 - 安全:自动TLS,可选客户端证书认证 - 快速:基准测试10,000写/秒 - 使用Raft保证一致性 ### etcd的优势 - 简单。使用Go语言编写部署简单;使用HTTP作为接口使用简单;使用Raft算法保证强一致性让用户易于理解。 - 数据持久化。etcd默认数据一更新就进行持久化。 - 安全。etcd支持TLS安全认证。 ### 相关名词解释 - Raft:etcd所采用的保证分布式系统强一致性的算法。 - Node:一个Raft状态机实例。 - Member: 一个etcd实例。它管理着一个Node,并且可以为客户端请求提供服务。 - Cluster:由多个Member构成可以协同工作的etcd集群。 - Peer:对同一个etcd集群中另外一个Member的称呼。 - Client: 向etcd集群发送HTTP请求的客户端。 - WAL:预写式日志,etcd用于持久化存储的日志格式。 - snapshot:etcd防止WAL文件过多而设置的快照,存储etcd数据状态 - Proxy:etcd的一种模式,为etcd集群提供反向代理服务。 - Leader:Raft算法中通过竞选而产生的处理所有数据提交的节点 - Follower:竞选失败的节点作为Raft中的从属节点,为算法提供强一致性保证。 - Candidate:当Follower超过一定时间接收不到Leader的心跳时转变为Candidate开始竞选。【候选人】 - Term:某个节点成为Leader到下一次竞选时间,称为一个Term。【任期】 - Index:数据项编号。Raft中通过Term和Index来定位数据。 ### 架构图 ![img](https://xieys.club/images/posts/clipboard.png) 一个用户的请求发送过来,会经由HTTP Server转发给Store进行具体的事务处理,如果涉及到节点的修改,则交给Raft模块进行状态的变更、日志的记录,然后再同步给别的etcd节点以确认数据提交,最后进行数据的提交,再次同步。 > HTTP Server 用于处理用户发送的API请求以及其它etcd节点的同步与心跳信息请求。 > Raft Raft强一致性算法的具体实现,是etcd的核心。 > WAL Write Ahead Log(预写式日志),是etcd的数据存储方式,用于系统提供原子性和持久性的一系列技术。除了在内存中存有所有数据的状态以及节点的索引以外,etcd就通过WAL进行持久化存储。WAL中,所有的数据提交前都会事先记录日志。 - Entry[日志内容]: 负责存储具体日志的内容。 - Snapshot[快照内容]: Snapshot是为了防止数据过多而进行的状态快照,日志内容发生变化时保存Raft的状态。 > Store 用于处理etcd支持的各类功能的事务,包括数据索引、节点状态变更、监控与反馈、事件处理与执行等等,是etcd对用户提供的大多数API功能的具体实现。 ## Raft 算法 ![img](https://xieys.club/images/posts/clipboard1.png) raft算法中涉及三种角色,分别是: - follower: 跟随者 - candidate: 候选者,选举过程中的中间状态角色 - leader: 领导者 ### 选举 有两个timeout来控制选举,第一个是election timeout,该时间是节点从follower到成为candidate的时间,该时间是150到300毫秒之间的随机值。另一个是heartbeat timeout。 - 当某个节点经历完election timeout成为candidate后,开启新的一个选举周期,他向其他节点发起投票请求(Request Vote),如果接收到消息的节点在该周期内还没投过票则给这个candidate投票,然后节点重置他的election timeout。 - 当该candidate获得大部分的选票,则可以当选为leader。 - leader就开始发送append entries给其他follower节点,这个消息会在内部指定的heartbeat timeout时间内发出,follower收到该信息则响应给leader。 - 这个选举周期会继续,直到某个follower没有收到心跳,并成为candidate。 - 如果某个选举周期内,有两个candidate同时获得相同多的选票,则会等待一个新的周期重新选举。 ### 同步 当选举过程结束,选出了leader,则leader需要把所有的变更同步的系统中的其他节点,该同步也是通过发送Append Entries的消息的方式。 - 首先一个客户端发送一个更新给leader,这个更新会添加到leader的日志中。 - 然后leader会在给follower的下次心跳探测中发送该更新。 - 一旦大多数follower收到这个更新并返回给leader,leader提交这个更新,然后返回给客户端。 ### 网络分区 - 当发生网络分区的时候,在不同分区的节点接收不到leader的心跳,则会开启一轮选举,形成不同leader的多个分区集群。 - 当客户端给不同leader的发送更新消息时,不同分区集群中的节点个数小于原先集群的一半时,更新不会被提交,而节点个数大于集群数一半时,更新会被提交。 - 当网络分区恢复后,被提交的更新会同步到其他的节点上,其他节点未提交的日志会被回滚并匹配新leader的日志,保证全局的数据是一致的。 ## Etcd启动配置参数 ### 核心参数说明 | 参数选项 | 说明 | | -------------------------------- | ------------------------------------------------------------ | | ETCD_DATA_DIR | 数据存储目录 | | ETCD_NAME | etcd集群中的节点名,这里可以随意,可区分且不重复就行 | | ETCD_LISTEN_PEER_URLS | 监听地址,用于节点之间通信的url,可多个,集群内数据交互(如选举,数据同步等) | | ETCD_INITIAL_ADVERTISE_PEER_URLS | 建议用于节点之间通信的url,节点间将以该值进行通信 | | ETCD_LISTEN_CLIENT_URLS | 监听地址,的用于客户端通信的url,同样可以监听多个 | | ETCD_ADVERTISE_CLIENT_URLS | 建议使用的客户端通信url,该值用于etcd代理或etcd成员与etcd节点通信 | | ETCD_INITIAL_CLUSTER | 集群中所有的 initial-advertise-peer-urls 的合集 | | ETCD_INITIAL_CLUSTER_TOKEN | 集群的token值,该值后集群将生成唯一id,并为每个节点也生成唯一id | | ETCD_INITIAL_CLUSTER_STATE | 初始化集群的标志,新建使用 new,加入一个存在的集群为 existing | ## etcdctl工具 etcdctl是一个命令行的客户端,它提供了一下简洁的命令,可理解为命令工具集,可以方便我们在对服务进行测试或者手动修改数据库内容。etcdctl与其他xxxctl的命令原理及操作类似(例如kubectl,systemctl)。 用法:etcdctl [global options] command [command options][args…] ### v2版本 #### 数据库操作命令 etcd 在键的组织上采用了层次化的空间结构(类似于文件系统中目录的概念),数据库操作围绕对键值和目录的 CRUD [增删改查](符合 REST 风格的一套操作:Create, Read, Update, Delete)完整生命周期的管理。 具体的命令选项参数可以通过 etcdctl command —help来获取相关帮助。 > 对象为键值 ``` set[增:无论是否存在]: etcdctl set key value mk[增:必须不存在]: etcdctl mk key value rm[删]: etcdctl rm key update[改]: etcdctl update key value get[查]: etcdctl get key ``` > 对象为目录 ``` setdir[增:无论是否存在]: etcdctl setdir dir mkdir[增:必须不存在]: etcdctl mkdir dir rmdir[删]: etcdctl rmdir dir updatedir[改]: etcdctl updatedir dir ls[查]: etcdclt ls ``` #### 非数据库操作命令 > backup [备份etcd的数据] ``` etcdctl backup ``` > watch [监测一个键值的变化,一旦键值发生了更新,就会输出最新的值并退出] ``` ectdctl watch key ``` > exec-watch [监测一个键值的变化,一旦键值发生更新,就执行给定命令] ``` etcdctl exec-watch key --sh -c "ls" ``` > member [通过list、add、remove、update等命令列出、添加、删除更新etcd实例到etcd集群中] ``` 列出 etcdctl member list 添加 etcdctl member add 实例 删除 etcdctl member remove 实例 更新 etcdctl member update 实例 ``` > etcdctl cluster-health [检查集群监控状态] **注意:这个命令只有v2版本才有,v3版本已剔除此命令** ### v3版本 使用etcdctl v2版本时,需要设置环境变量 `ETCDCTL_API=3`,除了以下操作不一样,其他操作都一致 #### 指定ectd版本以及集群 ``` ETCDCTL_API=3 ENDPOINTS=10.240.0.17:2379,10.240.0.18:2379,10.240.0.19:2379 ``` #### 数据库操作 ``` 1、增 etcdctl --endpoints=$ENDPOINTS put foo "Hello World!" 2、查 etcdctl --endpoints=$ENDPOINTS get foo # 输出为json格式 etcdctl --endpoints=$ENDPOINTS --write-out="json" get foo 基于相同前缀查找 etcdctl --endpoints=$ENDPOINTS put web1 value1 etcdctl --endpoints=$ENDPOINTS put web2 value2 etcdctl --endpoints=$ENDPOINTS put web3 value3 etcdctl --endpoints=$ENDPOINTS get web --prefix 列出所有的key etcdctl --endpoints=$ENDPOINTS get / --prefix --keys-only 3、删 etcdctl --endpoints=$ENDPOINTS put key myvalue etcdctl --endpoints=$ENDPOINTS del key etcdctl --endpoints=$ENDPOINTS put k1 value1 etcdctl --endpoints=$ENDPOINTS put k2 value2 etcdctl --endpoints=$ENDPOINTS del k --prefix ``` #### 集群状态 集群状态主要是`etcdctl endpoint status` 和`etcdctl endpoint health`两条命令。 ``` etcdctl --write-out=table --endpoints=$ENDPOINTS endpoint status +------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +------------------+------------------+---------+---------+-----------+-----------+------------+ | 10.240.0.17:2379 | 4917a7ab173fabe7 | 3.0.0 | 45 kB | true | 4 | 16726 | | 10.240.0.18:2379 | 59796ba9cd1bcd72 | 3.0.0 | 45 kB | false | 4 | 16726 | | 10.240.0.19:2379 | 94df724b66343e6c | 3.0.0 | 45 kB | false | 4 | 16726 | +------------------+------------------+---------+---------+-----------+-----------+------------+ etcdctl --endpoints=$ENDPOINTS endpoint health 10.240.0.17:2379 is healthy: successfully committed proposal: took = 3.345431ms 10.240.0.19:2379 is healthy: successfully committed proposal: took = 3.767967ms 10.240.0.18:2379 is healthy: successfully committed proposal: took = 4.025451ms ``` #### 集群成员 跟集群成员相关的命令如下: ``` member add Adds a member into the cluster member remove Removes a member from the cluster member update Updates a member in the cluster member list Lists all members in the cluster ``` 例如 etcdctl member list列出集群成员的命令。 ``` etcdctl --endpoints=http://172.16.5.4:12379 member list -w table +-----------------+---------+-------+------------------------+-----------------------------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +-----------------+---------+-------+------------------------+-----------------------------------------------+ | c856d92a82ba66a | started | etcd0 | http://172.16.5.4:2380 | http://172.16.5.4:2379,http://172.16.5.4:4001 | +-----------------+---------+-------+------------------------+-----------------------------------------------+ ``` ## 备份与恢复 ### 备份 ``` # mkdir /tmp/backup/etcd/ # 用于存放备份数据 # ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints="https://192.168.1.92:2379" snapshot save /tmp/backup/etcd/etcd-snapshot-`date +%Y%m%d`.db ``` ### 备份脚本 ``` #!/usr/bin/env bash CACERT="/etc/kubernetes/pki/etcd/ca.crt" CERT="/etc/kubernetes/pki/etcd/server.crt" EKY="/etc/kubernetes/pki/etcd/server.key" ENDPOINTS="https://192.168.1.92:2379" ETCDCTL_API=3 etcdctl \ --cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" \ --endpoints=${ENDPOINTS} \ snapshot save /tmp/backup/etcd/etcd-snapshot-`date +%Y%m%d`.db # 备份保留30天 find /tmp/backup/etcd -name *.db -mtime +30 -exec rm -f {} \; ``` ### 恢复 #### 单节点恢复 原集群为单节点集群,现在要拿备份文件做恢复操作,上面我把测试环境的etcd单节点已做备份,现在把它还原到另一台etcd节点上 ``` 安装 [root@etcd1 ~]# wget https://mirrors.huaweicloud.com/etcd/v3.4.10/etcd-v3.4.10-linux-amd64.tar.gz [root@etcd1 ~]# tar -xf etcd-v3.4.10-linux-amd64.tar.gz [root@etcd1 ~]# cp /root/etcd-v3.4.10-linux-amd64/{etcd,etcdctl} /usr/bin/ [root@etcd1 ~]# systemctl disable --now firewalld [root@etcd1 ~]# vim /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target [Service] Type=simple WorkingDirectory=/data/etcd EnvironmentFile=-/etc/etcd/etcd.conf ExecStart=/usr/bin/etcd [Install] WantedBy=multi-user.target [root@etcd1 ~]# mkdir -p /data/etcd [root@etcd1 ~]# mkdir -p /etc/etcd/ [root@etcd1 ~]# vim /etc/etcd/etcd.conf ETCD_NAME=default ETCD_DATA_DIR="/data/etcd/default.etcd/" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379" ETCD_ADVERTISE_CLIENT_URLS="http://0.0.0.0:2379" [root@etcd1 ~]# systemctl enable --now etcd 1、停止etcd服务 [root@etcd1 ~]# systemctl stop etcd 2、将备份文件拷贝到当前文件 3、备份之前的数据文件 [root@etcd1 ~]# mv /data/etcd/default.etcd{,.bak} 4、恢复 [root@etcd1 ~]# ETCDCTL_API=3 etcdctl snapshot restore etcd-snapshot-20210824.db --data-dir=/data/etcd/default.etcd {"level":"info","ts":1629804137.2476945,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"etcd-snapshot-20210824.db","wal-dir":"/data/etcd/default.etcd/member/wal","data-dir":"/data/etcd/default.etcd","snap-dir":"/data/etcd/default.etcd/member/snap"} {"level":"info","ts":1629804141.1843638,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":193110114} {"level":"info","ts":1629804142.0297222,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]} {"level":"info","ts":1629804142.0416465,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"etcd-snapshot-20210824.db","wal-dir":"/data/etcd/default.etcd/member/wal","data-dir":"/data/etcd/default.etcd","snap-dir":"/data/etcd/default.etcd/member/snap"} 5、启动 [root@etcd1 ~]# systemctl start etcd 6、验证 [root@etcd1 ~]# ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 get /registry/configmaps/kube-system/kube-flannel-cfg /registry/configmaps/kube-system/kube-flannel-cfg k8s v1 ConfigMap ¬ kube-flannel-cfg kube-system"*$54eb4186-a47b-11ea-b08d-000c293ad7922¤ appflannelZ tiernodeb² 0kubectl.kubernetes.io/last-applied-configurationiVersion":"v1","data":{"cni-conf.json":"{\n \"name\": \"cbr0\",\n \"cniVersion\": \"0.3.1\",\n \"plugins\": [\n {\n \"type\": \"flannel\",\n \"delegate\": {\n \"hairpinMode\": true,\n \"isDefaultGateway\": true\n }\n },\n {\n \"type\": \"portmap\",\n \"capabilities\": {\n \"portMappings\": true\n }\n }\n ]\n}\n","net-conf.json":"{\n \"Network\": \"10.244.0.0/16\",\n \"Backend\": {\n \"Type\": \"vxlan\"\n }\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"app":"flannel","tier":"node"},"name":"kube-flannel-cfg","namespace":"kube-system"}} z¶ cni-conf.json¤{ "name": "cbr0", "cniVersion": "0.3.1", "plugins": [ { "type": "flannel", "delegate": { "hairpinMode": true, "isDefaultGateway": true } }, { "type": "portmap", "capabilities": { "portMappings": true } } ] } Z net-conf.jsonI{ "Network": "10.244.0.0/16", "Backend": { "Type": "vxlan" } } " ``` #### 单节点数据迁移到集群 现在有需求:由于单点etcd不稳靠,所以现在需要把上面单台etcd的数据迁移到集群中。 | etcd-1 | 192.168.116.15 | | ------ | -------------- | | etcd-2 | 192.168.116.16 | | etcd-3 | 192.168.116.17 | > 创建证书 ``` 使用cfssl来生成自签证书,先下载cfssl工具: [root@etcd-1 ~]# wget https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 [root@etcd-1 ~]# wget https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 [root@etcd-1 ~]# wget https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64 [root@etcd-1 ~]# chmod +x cfssl_linux-amd64 cfssljson_linux-amd64 cfssl-certinfo_linux-amd64 [root@etcd-1 ~]# mv cfssl_linux-amd64 /usr/local/bin/cfssl [root@etcd-1 ~]# mv cfssljson_linux-amd64 /usr/local/bin/cfssljson [root@etcd-1 ~]# mv cfssl-certinfo_linux-amd64 /usr/local/bin/cfssl-certinfo # 创建CA(Certificate Authority) #导出默认配置模板 [root@etcd-1 ~]# mkdir ssl [root@etcd-1 ~]# cd ssl/ [root@etcd-1 ssl]# cfssl print-defaults config > config.json #导出默认证书签名请求csr模板 [root@etcd-1 ssl]# cfssl print-defaults csr > csr.json #根据config.json模板格式创建ca-config.json文件 [root@etcd-1 ssl]# cat config.json { "signing": { "default": { "expiry": "438000h" }, "profiles": { "server": { "expiry": "438000h", "usages": [ "signing", "key encipherment", "server auth", "client auth" ] }, "client": { "expiry": "438000h", "usages": [ "signing", "key encipherment", "client auth" ] }, "peer": { "expiry": "438000h", "usages": [ "signing", "key encipherment", "server auth", "client auth" ] } } } } # 根据csr.json模板格式创建ca-csr.json文件 [root@etcd-1 ssl]# cat ca-csr.json { "CN": "etcd", "key": { "algo": "ecdsa", "size": 256 } } #生成CA证书和私钥 [root@etcd-1 ssl]# cfssl gencert -initca ca-csr.json | cfssljson -bare ca 该命令会生成运行CA所必需的文件ca-key.pem(私钥)和ca.pem(证书),还会生成 ca.csr(证书签名请求),用于交叉签名或重新签名。 #创建client端证书签名请求csr文件 [root@etcd-1 ssl]# cat client.json { "CN": "client", "key": { "algo": "ecdsa", "size": 256 } } [root@etcd-1 ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=config.json -profile=client client.json | cfssljson -bare client #创建etcd server端和peer证书请求csr文件 [root@etcd-1 ssl]# cat etcd.json { "CN": "etcd", "hosts": [ "192.168.116.15", "192.168.116.16", "192.168.116.17", "192.168.116.18", "192.168.116.19", "192.168.116.20", "192.168.116.21", "192.168.116.22", "192.168.116.23", "192.168.116.24", "192.168.116.25" ], "key": { "algo": "ecdsa", "size": 256 }, "names": [ { "C": "CN", "L": "GG", "ST": "SZ" } ] } hosts内的其他节点为预留节点,作为扩容用的 #生成server端证书 [root@etcd-1 ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=config.json -profile=server etcd.json | cfssljson -bare server #生成peer证书 [root@etcd-1 ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=config.json -profile=peer etcd.json | cfssljson -bare peer ``` > 集群安装 ``` [root@etcd-1 ~]# wget https://mirrors.huaweicloud.com/etcd/v3.4.10/etcd-v3.4.10-linux-amd64.tar.gz [root@etcd-1 ~]# scp etcd-v3.4.10-linux-amd64.tar.gz 192.168.116.16: [root@etcd-1 ~]# scp etcd-v3.4.10-linux-amd64.tar.gz 192.168.116.17: #将证书目录拷贝过去 [root@etcd-1 ~]# scp -r ssl/ 192.168.116.16: [root@etcd-1 ~]# scp -r ssl/ 192.168.116.17: 解压二进制包,所有机器一样操作 [root@etcd-1 ~]# mkdir /opt/etcd/{bin,cfg,ssl,data} -p [root@etcd-1 ~]# tar -xf etcd-v3.4.10-linux-amd64.tar.gz -C /opt/ [root@etcd-1 ~]# mv /opt/etcd-v3.4.10-linux-amd64/{etcd,etcdctl} /opt/etcd/bin/ [root@etcd-1 ~]# cp /root/ssl/*.pem /opt/etcd/ssl/ [root@etcd-1 ~]# systemctl disable --now firewalld etcd-1上操作 #创建etcd的环境变量文件 [root@etcd-3 ~]# cat /opt/etcd/cfg/etcd.conf NAME="etcd-1" DATA_DIR="/opt/etcd/data" LISTEN_PEER_URLS="https://192.168.116.15:2380" LISTEN_CLIENT_URLS="https://192.168.116.15:2379" INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.15:2380" ADVERTISE_CLIENT_URLS="https://192.168.116.15:2379" INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380" INITIAL_CLUSTER_TOKEN="etcd-cluster" INITIAL_CLUSTER_STATE="new" #配置systemd管理etcd [root@etcd-3 ~]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --name=${NAME} \ --data-dir=${DATA_DIR} \ --listen-peer-urls=${LISTEN_PEER_URLS} \ --listen-client-urls=${LISTEN_CLIENT_URLS},https://127.0.0.1:2379 \ --advertise-client-urls=${ADVERTISE_CLIENT_URLS} \ --initial-advertise-peer-urls=${INITIAL_ADVERTISE_PEER_URLS} \ --initial-cluster=${INITIAL_CLUSTER} \ --initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \ --initial-cluster-state=${INITIAL_CLUSTER_STATE} \ --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/peer.pem \ --peer-key-file=/opt/etcd/ssl/peer-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target #数据目录权限要求700 [root@etcd-3 ~]# chmod 700 /opt/etcd/data/ etcd-2操作 [root@etcd-2 ~]# cat /opt/etcd/cfg/etcd.conf NAME="etcd-2" DATA_DIR="/opt/etcd/data" LISTEN_PEER_URLS="https://192.168.116.16:2380" LISTEN_CLIENT_URLS="https://192.168.116.16:2379" INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.16:2380" ADVERTISE_CLIENT_URLS="https://192.168.116.16:2379" INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380" INITIAL_CLUSTER_TOKEN="etcd-cluster" INITIAL_CLUSTER_STATE="new" [root@etcd-2 ~]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --name=${NAME} \ --data-dir=${DATA_DIR} \ --listen-peer-urls=${LISTEN_PEER_URLS} \ --listen-client-urls=${LISTEN_CLIENT_URLS},https://127.0.0.1:2379 \ --advertise-client-urls=${ADVERTISE_CLIENT_URLS} \ --initial-advertise-peer-urls=${INITIAL_ADVERTISE_PEER_URLS} \ --initial-cluster=${INITIAL_CLUSTER} \ --initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \ --initial-cluster-state=${INITIAL_CLUSTER_STATE} \ --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/peer.pem \ --peer-key-file=/opt/etcd/ssl/peer-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target [root@etcd-3 ~]# chmod 700 /opt/etcd/data/ 将etcd-1和etcd-2启动 [root@etcd-2 ~]# systemctl enable --now etcd 检查集群状态 [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table endpoint --cluster status +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.116.16:2379 | 38e677eb51b6e690 | 3.4.10 | 25 kB | false | false | 78 | 7 | 7 | | | https://192.168.116.15:2379 | b5779181c59c3700 | 3.4.10 | 25 kB | true | false | 78 | 7 | 7 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ 或者 [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table endpoint status 注意这里是列出的是endpoints里的节点,而endpoint --cluster status 列出的是集群所有节点 ``` > 将etcd3加入集群 ``` 添加节点etcd-3到集群内 [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" member add etcd-3 --peer-urls=https://192.168.116.17:2380 Member 4238c12f7fcf2ceb added to cluster 908202c1add782e4 ETCD_NAME="etcd-3" ETCD_INITIAL_CLUSTER="etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380,etcd-1=https://192.168.116.15:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.17:2380" ETCD_INITIAL_CLUSTER_STATE="existing" #注意新节点的etcd配置文件必须包括以上输出内容 查看集群成员 [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table member list +------------------+-----------+--------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+-----------+--------+-----------------------------+-----------------------------+------------+ | 4da14f0c9981c64 | unstarted | | https://192.168.116.17:2380 | | false | | 38e677eb51b6e690 | started | etcd-2 | https://192.168.116.16:2380 | https://192.168.116.16:2379 | false | | b5779181c59c3700 | started | etcd-1 | https://192.168.116.15:2380 | https://192.168.116.15:2379 | false | +------------------+-----------+--------+-----------------------------+-----------------------------+------------+ 在etcd-3上安装配置etcd [root@etcd-3 ~]# cat /opt/etcd/cfg/etcd.conf DATA_DIR="/opt/etcd/data" LISTEN_PEER_URLS="https://192.168.116.17:2380" LISTEN_CLIENT_URLS="https://192.168.116.17:2379" ADVERTISE_CLIENT_URLS="https://192.168.116.17:2379" INITIAL_CLUSTER_TOKEN="etcd-cluster" NAME="etcd-3" INITIAL_CLUSTER="etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380,etcd-1=https://192.168.116.15:2380" INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.17:2380" INITIAL_CLUSTER_STATE="existing" [root@etcd-3 ~]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --name=${NAME} \ --data-dir=${DATA_DIR} \ --listen-peer-urls=${LISTEN_PEER_URLS} \ --listen-client-urls=${LISTEN_CLIENT_URLS},https://127.0.0.1:2379 \ --advertise-client-urls=${ADVERTISE_CLIENT_URLS} \ --initial-advertise-peer-urls=${INITIAL_ADVERTISE_PEER_URLS} \ --initial-cluster=${INITIAL_CLUSTER} \ --initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \ --initial-cluster-state=${INITIAL_CLUSTER_STATE} \ --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/peer.pem \ --peer-key-file=/opt/etcd/ssl/peer-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target [root@etcd-3 ~]# chmod 700 /opt/etcd/data/ 将etcd-3上的etcd启动 [root@etcd-3 ~]# systemctl enable --now etcd #再次查看集群成员以及集群状态可以发现etcd-3已经成功加入到了集群内 [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table member list +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ | 1d1a009f14d49f5d | started | etcd-3 | https://192.168.116.17:2380 | https://192.168.116.17:2379 | false | | 38e677eb51b6e690 | started | etcd-2 | https://192.168.116.16:2380 | https://192.168.116.16:2379 | false | | b5779181c59c3700 | started | etcd-1 | https://192.168.116.15:2380 | https://192.168.116.15:2379 | false | +------------------+---------+--------+-----------------------------+-----------------------------+------------+ [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table endpoint --cluster status +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.116.17:2379 | 1d1a009f14d49f5d | 3.4.10 | 25 kB | false | false | 3 | 8 | 8 | | | https://192.168.116.16:2379 | 38e677eb51b6e690 | 3.4.10 | 25 kB | false | false | 3 | 8 | 8 | | | https://192.168.116.15:2379 | b5779181c59c3700 | 3.4.10 | 20 kB | true | false | 3 | 8 | 8 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ``` > 数据迁移 ``` 在集群节点中随便找一台执行以下操作,将单节点etcd1上的数据迁移到集群内 [root@etcd-1 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl make-mirror --endpoints="http://192.168.116.10:2379" https://192.168.116.15:2379 --dest-cacert=/opt/etcd/ssl/ca.pem --dest-cert=/opt/etcd/ssl/client.pem --dest-key=/opt/etcd/ssl/client-key.pem 迁移完后验证 例如,查看所有的键,由于键太多了,这里只看前20行的数据 [root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get / --prefix --keys-only | head -20 /registry/apiregistration.k8s.io/apiservices/v1. /registry/apiregistration.k8s.io/apiservices/v1.apps /registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io /registry/apiregistration.k8s.io/apiservices/v1.authorization.k8s.io /registry/apiregistration.k8s.io/apiservices/v1.autoscaling /registry/apiregistration.k8s.io/apiservices/v1.batch /registry/apiregistration.k8s.io/apiservices/v1.coordination.k8s.io /registry/apiregistration.k8s.io/apiservices/v1.networking.k8s.io /registry/apiregistration.k8s.io/apiservices/v1.rbac.authorization.k8s.io /registry/apiregistration.k8s.io/apiservices/v1.scheduling.k8s.io 注意镜像是不会停止的,每个30s会输出一个已同步键的数量,当键的数量长时间没变化就说明以同步完成 ``` #### 集群的备份与恢复 > 备份 ``` 集群中任意一台节点备份即可 [root@etcd-1 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" snapshot save etcd-snapshot-`date +%Y%m%d`.db ``` > 还原 ``` 停止集群中所有的etcd服务,所有节点 [root@etcd-1 ~]# systemctl stop etcd 备份之前的数据目录,所有节点 [root@etcd-1 ~]# mv /opt/etcd/data{,.bak} 准备还原脚本 [root@etcd-1 ~]# cat restore.sh #!/bin/bash count=1 for ip in 192.168.116.{15..17} do ETCDCTL_API=3 /opt/etcd/bin/etcdctl snapshot restore /root/etcd-snapshot-20210824.db --name etcd-${count} --initial-cluster "etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380" \ --initial-cluster-token etcd-cluster --initial-advertise-peer-urls https://${ip}:2380 --data-dir=/root/node${count} rsync -av --delete /root/node${count}/ $ip:/opt/etcd/data #注意这里/root/node${count}/ 结尾的/不能少,否则是把node1同步到/opt/etcd/data目录下,加上/就是把/root/node${count}/下的member同步到/opt/etcd/data目录下 let count++ done 执行脚本进行还原 [root@etcd-1 ~]# sh restore.sh 修改所有数据目录为700权限 [root@etcd-1 ~]# chmod 700 /opt/etcd/data 依次启动etcd 注意需要修改配置文件,由于刚开始是2个节点,所以需要把etcd-3节点也加入配置文件中 etcd-1 操作 [root@etcd-1 ~]# cat /opt/etcd/cfg/etcd.conf NAME="etcd-1" DATA_DIR="/opt/etcd/data" LISTEN_PEER_URLS="https://192.168.116.15:2380" LISTEN_CLIENT_URLS="https://192.168.116.15:2379" INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.15:2380" ADVERTISE_CLIENT_URLS="https://192.168.116.15:2379" INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380" INITIAL_CLUSTER_TOKEN="etcd-cluster" INITIAL_CLUSTER_STATE="new" etcd-2操作 [root@etcd-2 ~]# cat /opt/etcd/cfg/etcd.conf NAME="etcd-2" DATA_DIR="/opt/etcd/data" LISTEN_PEER_URLS="https://192.168.116.16:2380" LISTEN_CLIENT_URLS="https://192.168.116.16:2379" INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.16:2380" ADVERTISE_CLIENT_URLS="https://192.168.116.16:2379" INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380" INITIAL_CLUSTER_TOKEN="etcd-cluster" INITIAL_CLUSTER_STATE="new" etcd-3操作 [root@etcd-3 ~]# cat /opt/etcd/cfg/etcd.conf DATA_DIR="/opt/etcd/data" LISTEN_PEER_URLS="https://192.168.116.17:2380" LISTEN_CLIENT_URLS="https://192.168.116.17:2379" ADVERTISE_CLIENT_URLS="https://192.168.116.17:2379" INITIAL_CLUSTER_TOKEN="etcd-cluster" NAME="etcd-3" INITIAL_CLUSTER="etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380,etcd-1=https://192.168.116.15:2380" INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.17:2380" INITIAL_CLUSTER_STATE="new" 依次启动 [root@etcd-1 ~]# systemctl start etcd #验证 [root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" -w table endpoint status +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.116.15:2379 | b5779181c59c3700 | 3.4.10 | 25 kB | false | false | 41 | 9 | 9 | | | https://192.168.116.16:2379 | 38e677eb51b6e690 | 3.4.10 | 20 kB | false | false | 41 | 9 | 9 | | | https://192.168.116.17:2379 | 630a75ff7591bbf7 | 3.4.10 | 25 kB | true | false | 41 | 9 | 9 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ 验证数据是否恢复 [root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get / --prefix --keys-only | egrep -v '^$' | wc -l 24172 可以看到数据是已经恢复成功了 ``` #### kubernetes etcd恢复顺序 停止kube-apiserver --> 停止ETCD --> 恢复数据 --> 启动ETCD --> 启动kube-apiserve ## 历史数据压缩 key空间长期的时候,如果没有做压缩清理,到达上限的阈值时,集群会处于一个只能删除和读的状态,无法进行写操作。因此对集群的历史日志做一个压缩清理是很有必要。 数据压缩并不是清理现有数据,只是对数据的历史版本进行清理,清理后数据的历史版本将不能访问,但不会影响现有最新数据的访问。 ### 手动压缩 ``` 压缩清理revision为10之前的历史数据 [root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get /registry/configmaps/kube-system/kube-flannel-cfg reversion可以通过以下命令查看 [root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get /registry/configmaps/kube-system/kube-flannel-cfg -w json #访问revision10之前的数据会提示已经不存在 [root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get /registry/configmaps/kube-system/kube-flannel-cfg --rev=9 {"level":"warn","ts":"2021-08-24T23:02:30.151+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-26a4a8df-de68-459d-9999-2ef8a949b43e/192.168.116.15:2379","attempt":0,"error":"rpc error: code = OutOfRange desc = etcdserver: mvcc: required revision has been compacted"} Error: etcdserver: mvcc: required revision has been compacted ``` ### 自动压缩 使用`--auto-compaction-retention=1`,表示每小时进行一次数据压缩。 ### 碎片清理 进行compaction操作之后,旧的revision被压缩,会产生内部的碎片,内部碎片是指空闲状态的,能被后端使用但是仍然消耗存储空间的磁盘空间。去碎片化实际上是将存储空间还给文件系统。 ``` [root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" defrag ```