# etcd备份恢复以及碎片清理

## 简述

### Etcd是什么

etcd是一个分布式、一致性的键值存储系统，主要用于配置共享和服务发现。

- 安全:自动TLS，可选客户端证书认证
- 快速:基准测试10,000写/秒
- 使用Raft保证一致性

### etcd的优势

- 简单。使用Go语言编写部署简单；使用HTTP作为接口使用简单；使用Raft算法保证强一致性让用户易于理解。
- 数据持久化。etcd默认数据一更新就进行持久化。
- 安全。etcd支持TLS安全认证。

### 相关名词解释

- Raft：etcd所采用的保证分布式系统强一致性的算法。
- Node：一个Raft状态机实例。
- Member： 一个etcd实例。它管理着一个Node，并且可以为客户端请求提供服务。
- Cluster：由多个Member构成可以协同工作的etcd集群。
- Peer：对同一个etcd集群中另外一个Member的称呼。
- Client： 向etcd集群发送HTTP请求的客户端。
- WAL：预写式日志，etcd用于持久化存储的日志格式。
- snapshot：etcd防止WAL文件过多而设置的快照，存储etcd数据状态
- Proxy：etcd的一种模式，为etcd集群提供反向代理服务。
- Leader：Raft算法中通过竞选而产生的处理所有数据提交的节点
- Follower：竞选失败的节点作为Raft中的从属节点，为算法提供强一致性保证。
- Candidate：当Follower超过一定时间接收不到Leader的心跳时转变为Candidate开始竞选。【候选人】
- Term：某个节点成为Leader到下一次竞选时间，称为一个Term。【任期】
- Index：数据项编号。Raft中通过Term和Index来定位数据。

### 架构图

![img](https://xieys.club/images/posts/clipboard.png)

一个用户的请求发送过来，会经由HTTP Server转发给Store进行具体的事务处理，如果涉及到节点的修改，则交给Raft模块进行状态的变更、日志的记录，然后再同步给别的etcd节点以确认数据提交，最后进行数据的提交，再次同步。

> HTTP Server

用于处理用户发送的API请求以及其它etcd节点的同步与心跳信息请求。

> Raft

Raft强一致性算法的具体实现，是etcd的核心。

> WAL

Write Ahead Log（预写式日志），是etcd的数据存储方式，用于系统提供原子性和持久性的一系列技术。除了在内存中存有所有数据的状态以及节点的索引以外，etcd就通过WAL进行持久化存储。WAL中，所有的数据提交前都会事先记录日志。

- Entry[日志内容]: 负责存储具体日志的内容。
- Snapshot[快照内容]: Snapshot是为了防止数据过多而进行的状态快照，日志内容发生变化时保存Raft的状态。

> Store

用于处理etcd支持的各类功能的事务，包括数据索引、节点状态变更、监控与反馈、事件处理与执行等等，是etcd对用户提供的大多数API功能的具体实现。

## Raft 算法

![img](https://xieys.club/images/posts/clipboard1.png)

raft算法中涉及三种角色，分别是：

- follower: 跟随者
- candidate: 候选者，选举过程中的中间状态角色
- leader: 领导者

### 选举

有两个timeout来控制选举，第一个是election timeout，该时间是节点从follower到成为candidate的时间，该时间是150到300毫秒之间的随机值。另一个是heartbeat timeout。

- 当某个节点经历完election timeout成为candidate后，开启新的一个选举周期，他向其他节点发起投票请求（Request Vote），如果接收到消息的节点在该周期内还没投过票则给这个candidate投票，然后节点重置他的election timeout。
- 当该candidate获得大部分的选票，则可以当选为leader。
- leader就开始发送append entries给其他follower节点，这个消息会在内部指定的heartbeat timeout时间内发出，follower收到该信息则响应给leader。
- 这个选举周期会继续，直到某个follower没有收到心跳，并成为candidate。
- 如果某个选举周期内，有两个candidate同时获得相同多的选票，则会等待一个新的周期重新选举。

### 同步

当选举过程结束，选出了leader，则leader需要把所有的变更同步的系统中的其他节点，该同步也是通过发送Append Entries的消息的方式。

- 首先一个客户端发送一个更新给leader，这个更新会添加到leader的日志中。
- 然后leader会在给follower的下次心跳探测中发送该更新。
- 一旦大多数follower收到这个更新并返回给leader，leader提交这个更新，然后返回给客户端。

### 网络分区

- 当发生网络分区的时候，在不同分区的节点接收不到leader的心跳，则会开启一轮选举，形成不同leader的多个分区集群。
- 当客户端给不同leader的发送更新消息时，不同分区集群中的节点个数小于原先集群的一半时，更新不会被提交，而节点个数大于集群数一半时，更新会被提交。
- 当网络分区恢复后，被提交的更新会同步到其他的节点上，其他节点未提交的日志会被回滚并匹配新leader的日志，保证全局的数据是一致的。

## Etcd启动配置参数

### 核心参数说明

| 参数选项                         | 说明                                                         |
| -------------------------------- | ------------------------------------------------------------ |
| ETCD_DATA_DIR                    | 数据存储目录                                                 |
| ETCD_NAME                        | etcd集群中的节点名，这里可以随意，可区分且不重复就行         |
| ETCD_LISTEN_PEER_URLS            | 监听地址，用于节点之间通信的url，可多个，集群内数据交互(如选举，数据同步等) |
| ETCD_INITIAL_ADVERTISE_PEER_URLS | 建议用于节点之间通信的url，节点间将以该值进行通信            |
| ETCD_LISTEN_CLIENT_URLS          | 监听地址，的用于客户端通信的url，同样可以监听多个            |
| ETCD_ADVERTISE_CLIENT_URLS       | 建议使用的客户端通信url，该值用于etcd代理或etcd成员与etcd节点通信 |
| ETCD_INITIAL_CLUSTER             | 集群中所有的 initial-advertise-peer-urls 的合集              |
| ETCD_INITIAL_CLUSTER_TOKEN       | 集群的token值，该值后集群将生成唯一id，并为每个节点也生成唯一id |
| ETCD_INITIAL_CLUSTER_STATE       | 初始化集群的标志，新建使用 new，加入一个存在的集群为 existing |

## etcdctl工具

etcdctl是一个命令行的客户端，它提供了一下简洁的命令，可理解为命令工具集，可以方便我们在对服务进行测试或者手动修改数据库内容。etcdctl与其他xxxctl的命令原理及操作类似（例如kubectl，systemctl）。

用法：etcdctl [global options] command [command options][args…]

### v2版本

#### 数据库操作命令

etcd 在键的组织上采用了层次化的空间结构（类似于文件系统中目录的概念），数据库操作围绕对键值和目录的 CRUD [增删改查]（符合 REST 风格的一套操作：Create, Read, Update, Delete）完整生命周期的管理。

具体的命令选项参数可以通过 etcdctl command —help来获取相关帮助。

> 对象为键值

```
set[增:无论是否存在]:
etcdctl set key value

mk[增:必须不存在]:
etcdctl mk key value

rm[删]:
etcdctl rm key

update[改]:
etcdctl update key value

get[查]:
etcdctl get key
```

> 对象为目录

```
setdir[增:无论是否存在]:
etcdctl setdir dir

mkdir[增:必须不存在]: 
etcdctl mkdir dir

rmdir[删]:
etcdctl rmdir dir

updatedir[改]:
etcdctl updatedir dir

ls[查]:
etcdclt ls
```

#### 非数据库操作命令

> backup [备份etcd的数据]

```
etcdctl backup
```

> watch [监测一个键值的变化，一旦键值发生了更新，就会输出最新的值并退出]

```
ectdctl watch key
```

> exec-watch [监测一个键值的变化，一旦键值发生更新，就执行给定命令]

```
etcdctl exec-watch key --sh -c "ls"
```

> member [通过list、add、remove、update等命令列出、添加、删除更新etcd实例到etcd集群中]

```
列出
etcdctl member list
添加
etcdctl member add 实例
删除
etcdctl member remove 实例
更新
etcdctl member update 实例
```

> etcdctl cluster-health [检查集群监控状态]

**注意：这个命令只有v2版本才有，v3版本已剔除此命令**

### v3版本

使用etcdctl v2版本时，需要设置环境变量 `ETCDCTL_API=3`,除了以下操作不一样，其他操作都一致

#### 指定ectd版本以及集群

```
ETCDCTL_API=3
ENDPOINTS=10.240.0.17:2379,10.240.0.18:2379,10.240.0.19:2379
```

#### 数据库操作

```
1、增
etcdctl --endpoints=$ENDPOINTS put foo "Hello World!"

2、查
etcdctl --endpoints=$ENDPOINTS get foo
# 输出为json格式
etcdctl --endpoints=$ENDPOINTS --write-out="json" get foo

基于相同前缀查找
etcdctl --endpoints=$ENDPOINTS put web1 value1
etcdctl --endpoints=$ENDPOINTS put web2 value2
etcdctl --endpoints=$ENDPOINTS put web3 value3
etcdctl --endpoints=$ENDPOINTS get web --prefix

列出所有的key
etcdctl --endpoints=$ENDPOINTS get / --prefix --keys-only

3、删
etcdctl --endpoints=$ENDPOINTS put key myvalue
etcdctl --endpoints=$ENDPOINTS del key
etcdctl --endpoints=$ENDPOINTS put k1 value1
etcdctl --endpoints=$ENDPOINTS put k2 value2
etcdctl --endpoints=$ENDPOINTS del k --prefix


```

#### 集群状态

集群状态主要是`etcdctl endpoint status` 和`etcdctl endpoint health`两条命令。

```
etcdctl --write-out=table --endpoints=$ENDPOINTS endpoint status
+------------------+------------------+---------+---------+-----------+-----------+------------+
|     ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------+------------------+---------+---------+-----------+-----------+------------+
| 10.240.0.17:2379 | 4917a7ab173fabe7 | 3.0.0   | 45 kB   | true      |         4 |      16726 |
| 10.240.0.18:2379 | 59796ba9cd1bcd72 | 3.0.0   | 45 kB   | false     |         4 |      16726 |
| 10.240.0.19:2379 | 94df724b66343e6c | 3.0.0   | 45 kB   | false     |         4 |      16726 |
+------------------+------------------+---------+---------+-----------+-----------+------------+
etcdctl --endpoints=$ENDPOINTS endpoint health
10.240.0.17:2379 is healthy: successfully committed proposal: took = 3.345431ms
10.240.0.19:2379 is healthy: successfully committed proposal: took = 3.767967ms
10.240.0.18:2379 is healthy: successfully committed proposal: took = 4.025451ms
```

#### 集群成员

跟集群成员相关的命令如下：

```
member add            Adds a member into the cluster
member remove        Removes a member from the cluster
member update        Updates a member in the cluster
member list            Lists all members in the cluster
```

例如 etcdctl member list列出集群成员的命令。

```
etcdctl --endpoints=http://172.16.5.4:12379 member list -w table
+-----------------+---------+-------+------------------------+-----------------------------------------------+
|       ID        | STATUS  | NAME  |       PEER ADDRS       |                 CLIENT ADDRS                  |
+-----------------+---------+-------+------------------------+-----------------------------------------------+
| c856d92a82ba66a | started | etcd0 | http://172.16.5.4:2380 | http://172.16.5.4:2379,http://172.16.5.4:4001 |
+-----------------+---------+-------+------------------------+-----------------------------------------------+
```

## 备份与恢复

### 备份

```
# mkdir  /tmp/backup/etcd/		# 用于存放备份数据
# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key  --endpoints="https://192.168.1.92:2379" snapshot save /tmp/backup/etcd/etcd-snapshot-`date +%Y%m%d`.db
```

### 备份脚本

```
#!/usr/bin/env bash

CACERT="/etc/kubernetes/pki/etcd/ca.crt"
CERT="/etc/kubernetes/pki/etcd/server.crt"
EKY="/etc/kubernetes/pki/etcd/server.key"
ENDPOINTS="https://192.168.1.92:2379"

ETCDCTL_API=3 etcdctl \
--cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" \
--endpoints=${ENDPOINTS} \
snapshot save /tmp/backup/etcd/etcd-snapshot-`date +%Y%m%d`.db

# 备份保留30天
find /tmp/backup/etcd -name *.db -mtime +30 -exec rm -f {} \;
```

### 恢复

#### 单节点恢复

原集群为单节点集群，现在要拿备份文件做恢复操作,上面我把测试环境的etcd单节点已做备份，现在把它还原到另一台etcd节点上

```
安装
[root@etcd1 ~]# wget https://mirrors.huaweicloud.com/etcd/v3.4.10/etcd-v3.4.10-linux-amd64.tar.gz
[root@etcd1 ~]# tar -xf etcd-v3.4.10-linux-amd64.tar.gz
[root@etcd1 ~]# cp /root/etcd-v3.4.10-linux-amd64/{etcd,etcdctl} /usr/bin/
[root@etcd1 ~]# systemctl disable --now firewalld
[root@etcd1 ~]# vim /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/data/etcd
EnvironmentFile=-/etc/etcd/etcd.conf
ExecStart=/usr/bin/etcd

[Install]
WantedBy=multi-user.target

[root@etcd1 ~]# mkdir -p /data/etcd
[root@etcd1 ~]# mkdir -p /etc/etcd/
[root@etcd1 ~]# vim /etc/etcd/etcd.conf
ETCD_NAME=default
ETCD_DATA_DIR="/data/etcd/default.etcd/"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://0.0.0.0:2379"

[root@etcd1 ~]# systemctl enable --now etcd


1、停止etcd服务
[root@etcd1 ~]# systemctl stop etcd
2、将备份文件拷贝到当前文件
3、备份之前的数据文件
[root@etcd1 ~]# mv /data/etcd/default.etcd{,.bak}
4、恢复
[root@etcd1 ~]# ETCDCTL_API=3 etcdctl snapshot restore etcd-snapshot-20210824.db --data-dir=/data/etcd/default.etcd 
{"level":"info","ts":1629804137.2476945,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"etcd-snapshot-20210824.db","wal-dir":"/data/etcd/default.etcd/member/wal","data-dir":"/data/etcd/default.etcd","snap-dir":"/data/etcd/default.etcd/member/snap"}
{"level":"info","ts":1629804141.1843638,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":193110114}
{"level":"info","ts":1629804142.0297222,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1629804142.0416465,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"etcd-snapshot-20210824.db","wal-dir":"/data/etcd/default.etcd/member/wal","data-dir":"/data/etcd/default.etcd","snap-dir":"/data/etcd/default.etcd/member/snap"}


5、启动
[root@etcd1 ~]# systemctl start etcd

6、验证
[root@etcd1 ~]# ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379  get /registry/configmaps/kube-system/kube-flannel-cfg
/registry/configmaps/kube-system/kube-flannel-cfg
k8s

v1	ConfigMap	
¬ 
kube-flannel-cfg 
                kube-system"*$54eb4186-a47b-11ea-b08d-000c293ad7922¤������
appflannelZ

tiernodeb²
0kubectl.kubernetes.io/last-applied-configuration������iVersion":"v1","data":{"cni-conf.json":"{\n  \"name\": \"cbr0\",\n  \"cniVersion\": \"0.3.1\",\n  \"plugins\": [\n    {\n      \"type\": \"flannel\",\n      \"delegate\": {\n        \"hairpinMode\": true,\n        \"isDefaultGateway\": true\n      }\n    },\n    {\n      \"type\": \"portmap\",\n      \"capabilities\": {\n        \"portMappings\": true\n      }\n    }\n  ]\n}\n","net-conf.json":"{\n  \"Network\": \"10.244.0.0/16\",\n  \"Backend\": {\n    \"Type\": \"vxlan\"\n  }\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"app":"flannel","tier":"node"},"name":"kube-flannel-cfg","namespace":"kube-system"}}
z¶ 
cni-conf.json¤{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}
Z
net-conf.jsonI{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan"
  }
}
"

```

#### 单节点数据迁移到集群

现在有需求：由于单点etcd不稳靠，所以现在需要把上面单台etcd的数据迁移到集群中。

| etcd-1 | 192.168.116.15 |
| ------ | -------------- |
| etcd-2 | 192.168.116.16 |
| etcd-3 | 192.168.116.17 |

> 创建证书

```
使用cfssl来生成自签证书，先下载cfssl工具：
[root@etcd-1 ~]# wget https://pkg.cfssl.org/R1.2/cfssl_linux-amd64
[root@etcd-1 ~]#  wget https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64
[root@etcd-1 ~]# wget https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64
[root@etcd-1 ~]# chmod +x cfssl_linux-amd64 cfssljson_linux-amd64  cfssl-certinfo_linux-amd64 
[root@etcd-1 ~]# mv cfssl_linux-amd64 /usr/local/bin/cfssl
[root@etcd-1 ~]# mv cfssljson_linux-amd64 /usr/local/bin/cfssljson
[root@etcd-1 ~]# mv cfssl-certinfo_linux-amd64 /usr/local/bin/cfssl-certinfo

# 创建CA（Certificate Authority）
#导出默认配置模板
[root@etcd-1 ~]# mkdir ssl
[root@etcd-1 ~]# cd ssl/
[root@etcd-1 ssl]# cfssl print-defaults config > config.json
#导出默认证书签名请求csr模板
[root@etcd-1 ssl]# cfssl print-defaults csr > csr.json

#根据config.json模板格式创建ca-config.json文件
[root@etcd-1 ssl]# cat config.json 
{
    "signing": {
        "default": {
            "expiry": "438000h"
        },
        "profiles": {
            "server": {
                "expiry": "438000h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "server auth",
		    "client auth"
                ]
            },
            "client": {
                "expiry": "438000h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "client auth"
                ]
            },
	    "peer": {
    		"expiry": "438000h",
		"usages": [
		    "signing",
		    "key encipherment",
		    "server auth",
		    "client auth"
		]
	    }
        }
    }
}

# 根据csr.json模板格式创建ca-csr.json文件
[root@etcd-1 ssl]# cat ca-csr.json 
{
    "CN": "etcd",
    "key": {
        "algo": "ecdsa",
        "size": 256
    }
}
#生成CA证书和私钥
[root@etcd-1 ssl]# cfssl gencert -initca ca-csr.json  | cfssljson -bare ca
该命令会生成运行CA所必需的文件ca-key.pem（私钥）和ca.pem（证书），还会生成 ca.csr（证书签名请求），用于交叉签名或重新签名。

#创建client端证书签名请求csr文件
[root@etcd-1 ssl]# cat client.json 
{
 "CN": "client",
 "key": {
  "algo": "ecdsa",
  "size": 256
  }
}
[root@etcd-1 ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=config.json  -profile=client client.json | cfssljson -bare client

#创建etcd server端和peer证书请求csr文件
[root@etcd-1 ssl]# cat etcd.json 
{
    "CN": "etcd",
    "hosts": [
        "192.168.116.15",
        "192.168.116.16",
        "192.168.116.17",
        "192.168.116.18",
        "192.168.116.19",
        "192.168.116.20",
        "192.168.116.21",
        "192.168.116.22",
        "192.168.116.23",
        "192.168.116.24",
        "192.168.116.25"
    ],
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "C": "CN",
            "L": "GG",
            "ST": "SZ"
        }
    ]
}
hosts内的其他节点为预留节点，作为扩容用的
#生成server端证书
[root@etcd-1 ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem  -config=config.json -profile=server etcd.json | cfssljson -bare server

#生成peer证书
[root@etcd-1 ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem  -config=config.json -profile=peer etcd.json | cfssljson -bare peer
```


> 集群安装

```
[root@etcd-1 ~]# wget https://mirrors.huaweicloud.com/etcd/v3.4.10/etcd-v3.4.10-linux-amd64.tar.gz
[root@etcd-1 ~]# scp etcd-v3.4.10-linux-amd64.tar.gz 192.168.116.16:
[root@etcd-1 ~]# scp etcd-v3.4.10-linux-amd64.tar.gz 192.168.116.17:
#将证书目录拷贝过去
[root@etcd-1 ~]# scp -r ssl/ 192.168.116.16:
[root@etcd-1 ~]# scp -r ssl/ 192.168.116.17:

解压二进制包,所有机器一样操作
[root@etcd-1 ~]# mkdir /opt/etcd/{bin,cfg,ssl,data} -p
[root@etcd-1 ~]# tar -xf etcd-v3.4.10-linux-amd64.tar.gz  -C /opt/
[root@etcd-1 ~]#  mv /opt/etcd-v3.4.10-linux-amd64/{etcd,etcdctl} /opt/etcd/bin/
[root@etcd-1 ~]# cp /root/ssl/*.pem /opt/etcd/ssl/
[root@etcd-1 ~]# systemctl disable --now firewalld

etcd-1上操作
#创建etcd的环境变量文件
[root@etcd-3 ~]# cat /opt/etcd/cfg/etcd.conf
NAME="etcd-1"
DATA_DIR="/opt/etcd/data"
LISTEN_PEER_URLS="https://192.168.116.15:2380"
LISTEN_CLIENT_URLS="https://192.168.116.15:2379"
INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.15:2380"
ADVERTISE_CLIENT_URLS="https://192.168.116.15:2379"
INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380"
INITIAL_CLUSTER_TOKEN="etcd-cluster"
INITIAL_CLUSTER_STATE="new"

#配置systemd管理etcd
[root@etcd-3 ~]# cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd \
--name=${NAME} \
--data-dir=${DATA_DIR} \
--listen-peer-urls=${LISTEN_PEER_URLS} \
--listen-client-urls=${LISTEN_CLIENT_URLS},https://127.0.0.1:2379 \
--advertise-client-urls=${ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=${INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=${INITIAL_CLUSTER} \
--initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=${INITIAL_CLUSTER_STATE} \
--cert-file=/opt/etcd/ssl/server.pem \
--key-file=/opt/etcd/ssl/server-key.pem \
--peer-cert-file=/opt/etcd/ssl/peer.pem \
--peer-key-file=/opt/etcd/ssl/peer-key.pem \
--trusted-ca-file=/opt/etcd/ssl/ca.pem \
--peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

#数据目录权限要求700
[root@etcd-3 ~]# chmod 700  /opt/etcd/data/

etcd-2操作
[root@etcd-2 ~]# cat /opt/etcd/cfg/etcd.conf
NAME="etcd-2"
DATA_DIR="/opt/etcd/data"
LISTEN_PEER_URLS="https://192.168.116.16:2380"
LISTEN_CLIENT_URLS="https://192.168.116.16:2379"
INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.16:2380"
ADVERTISE_CLIENT_URLS="https://192.168.116.16:2379"
INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380"
INITIAL_CLUSTER_TOKEN="etcd-cluster"
INITIAL_CLUSTER_STATE="new"

[root@etcd-2 ~]# cat /usr/lib/systemd/system/etcd.service 
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd \
--name=${NAME} \
--data-dir=${DATA_DIR} \
--listen-peer-urls=${LISTEN_PEER_URLS} \
--listen-client-urls=${LISTEN_CLIENT_URLS},https://127.0.0.1:2379 \
--advertise-client-urls=${ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=${INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=${INITIAL_CLUSTER} \
--initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=${INITIAL_CLUSTER_STATE} \
--cert-file=/opt/etcd/ssl/server.pem \
--key-file=/opt/etcd/ssl/server-key.pem \
--peer-cert-file=/opt/etcd/ssl/peer.pem \
--peer-key-file=/opt/etcd/ssl/peer-key.pem \
--trusted-ca-file=/opt/etcd/ssl/ca.pem \
--peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

[root@etcd-3 ~]# chmod 700  /opt/etcd/data/

将etcd-1和etcd-2启动
[root@etcd-2 ~]# systemctl enable --now etcd

检查集群状态
[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table endpoint --cluster status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.116.16:2379 | 38e677eb51b6e690 |  3.4.10 |   25 kB |     false |      false |        78 |          7 |                  7 |        |
| https://192.168.116.15:2379 | b5779181c59c3700 |  3.4.10 |   25 kB |      true |      false |        78 |          7 |                  7 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

或者
[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table endpoint status
注意这里是列出的是endpoints里的节点，而endpoint --cluster status 列出的是集群所有节点
```

> 将etcd3加入集群

```
添加节点etcd-3到集群内
[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" member add  etcd-3 --peer-urls=https://192.168.116.17:2380
Member 4238c12f7fcf2ceb added to cluster 908202c1add782e4

ETCD_NAME="etcd-3"
ETCD_INITIAL_CLUSTER="etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380,etcd-1=https://192.168.116.15:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.17:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

#注意新节点的etcd配置文件必须包括以上输出内容

查看集群成员
[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table member list
+------------------+-----------+--------+-----------------------------+-----------------------------+------------+
|        ID        |  STATUS   |  NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+-----------+--------+-----------------------------+-----------------------------+------------+
|  4da14f0c9981c64 | unstarted |        | https://192.168.116.17:2380 |                             |      false |
| 38e677eb51b6e690 |   started | etcd-2 | https://192.168.116.16:2380 | https://192.168.116.16:2379 |      false |
| b5779181c59c3700 |   started | etcd-1 | https://192.168.116.15:2380 | https://192.168.116.15:2379 |      false |
+------------------+-----------+--------+-----------------------------+-----------------------------+------------+

在etcd-3上安装配置etcd

[root@etcd-3 ~]# cat /opt/etcd/cfg/etcd.conf 
DATA_DIR="/opt/etcd/data"
LISTEN_PEER_URLS="https://192.168.116.17:2380"
LISTEN_CLIENT_URLS="https://192.168.116.17:2379"
ADVERTISE_CLIENT_URLS="https://192.168.116.17:2379"
INITIAL_CLUSTER_TOKEN="etcd-cluster"
NAME="etcd-3"
INITIAL_CLUSTER="etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380,etcd-1=https://192.168.116.15:2380"
INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.17:2380"
INITIAL_CLUSTER_STATE="existing"

[root@etcd-3 ~]# cat /usr/lib/systemd/system/etcd.service 
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd \
--name=${NAME} \
--data-dir=${DATA_DIR} \
--listen-peer-urls=${LISTEN_PEER_URLS} \
--listen-client-urls=${LISTEN_CLIENT_URLS},https://127.0.0.1:2379 \
--advertise-client-urls=${ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=${INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=${INITIAL_CLUSTER} \
--initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=${INITIAL_CLUSTER_STATE} \
--cert-file=/opt/etcd/ssl/server.pem \
--key-file=/opt/etcd/ssl/server-key.pem \
--peer-cert-file=/opt/etcd/ssl/peer.pem \
--peer-key-file=/opt/etcd/ssl/peer-key.pem \
--trusted-ca-file=/opt/etcd/ssl/ca.pem \
--peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target


[root@etcd-3 ~]# chmod 700  /opt/etcd/data/

将etcd-3上的etcd启动
[root@etcd-3 ~]# systemctl enable --now etcd

#再次查看集群成员以及集群状态可以发现etcd-3已经成功加入到了集群内
[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table member list
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |  NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+
| 1d1a009f14d49f5d | started | etcd-3 | https://192.168.116.17:2380 | https://192.168.116.17:2379 |      false |
| 38e677eb51b6e690 | started | etcd-2 | https://192.168.116.16:2380 | https://192.168.116.16:2379 |      false |
| b5779181c59c3700 | started | etcd-1 | https://192.168.116.15:2380 | https://192.168.116.15:2379 |      false |
+------------------+---------+--------+-----------------------------+-----------------------------+------------+

[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379" -w table endpoint --cluster status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.116.17:2379 | 1d1a009f14d49f5d |  3.4.10 |   25 kB |     false |      false |         3 |          8 |                  8 |        |
| https://192.168.116.16:2379 | 38e677eb51b6e690 |  3.4.10 |   25 kB |     false |      false |         3 |          8 |                  8 |        |
| https://192.168.116.15:2379 | b5779181c59c3700 |  3.4.10 |   20 kB |      true |      false |         3 |          8 |                  8 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

```

> 数据迁移

```
在集群节点中随便找一台执行以下操作，将单节点etcd1上的数据迁移到集群内
[root@etcd-1 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl make-mirror --endpoints="http://192.168.116.10:2379"  https://192.168.116.15:2379 --dest-cacert=/opt/etcd/ssl/ca.pem --dest-cert=/opt/etcd/ssl/client.pem --dest-key=/opt/etcd/ssl/client-key.pem

迁移完后验证
例如，查看所有的键,由于键太多了，这里只看前20行的数据
[root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get / --prefix --keys-only | head -20
/registry/apiregistration.k8s.io/apiservices/v1.

/registry/apiregistration.k8s.io/apiservices/v1.apps

/registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io

/registry/apiregistration.k8s.io/apiservices/v1.authorization.k8s.io

/registry/apiregistration.k8s.io/apiservices/v1.autoscaling

/registry/apiregistration.k8s.io/apiservices/v1.batch

/registry/apiregistration.k8s.io/apiservices/v1.coordination.k8s.io

/registry/apiregistration.k8s.io/apiservices/v1.networking.k8s.io

/registry/apiregistration.k8s.io/apiservices/v1.rbac.authorization.k8s.io

/registry/apiregistration.k8s.io/apiservices/v1.scheduling.k8s.io

注意镜像是不会停止的，每个30s会输出一个已同步键的数量，当键的数量长时间没变化就说明以同步完成


```

#### 集群的备份与恢复

> 备份

```
集群中任意一台节点备份即可
[root@etcd-1 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" snapshot save etcd-snapshot-`date +%Y%m%d`.db

```

> 还原

```
停止集群中所有的etcd服务，所有节点
[root@etcd-1 ~]# systemctl stop etcd

备份之前的数据目录，所有节点
[root@etcd-1 ~]# mv /opt/etcd/data{,.bak}

准备还原脚本
[root@etcd-1 ~]# cat restore.sh 
#!/bin/bash
count=1
for ip in 192.168.116.{15..17}
do
    ETCDCTL_API=3 /opt/etcd/bin/etcdctl snapshot restore /root/etcd-snapshot-20210824.db --name etcd-${count} --initial-cluster "etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380"  \
    --initial-cluster-token etcd-cluster --initial-advertise-peer-urls https://${ip}:2380 --data-dir=/root/node${count}
    rsync -av --delete /root/node${count}/ $ip:/opt/etcd/data #注意这里/root/node${count}/ 结尾的/不能少，否则是把node1同步到/opt/etcd/data目录下，加上/就是把/root/node${count}/下的member同步到/opt/etcd/data目录下
    let count++
done

执行脚本进行还原
[root@etcd-1 ~]# sh restore.sh

修改所有数据目录为700权限
[root@etcd-1 ~]# chmod 700 /opt/etcd/data

依次启动etcd
注意需要修改配置文件，由于刚开始是2个节点，所以需要把etcd-3节点也加入配置文件中
etcd-1 操作
[root@etcd-1 ~]# cat /opt/etcd/cfg/etcd.conf
NAME="etcd-1"
DATA_DIR="/opt/etcd/data"
LISTEN_PEER_URLS="https://192.168.116.15:2380"
LISTEN_CLIENT_URLS="https://192.168.116.15:2379"
INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.15:2380"
ADVERTISE_CLIENT_URLS="https://192.168.116.15:2379"
INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380"
INITIAL_CLUSTER_TOKEN="etcd-cluster"
INITIAL_CLUSTER_STATE="new"

etcd-2操作
[root@etcd-2 ~]# cat /opt/etcd/cfg/etcd.conf
NAME="etcd-2"
DATA_DIR="/opt/etcd/data"
LISTEN_PEER_URLS="https://192.168.116.16:2380"
LISTEN_CLIENT_URLS="https://192.168.116.16:2379"
INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.16:2380"
ADVERTISE_CLIENT_URLS="https://192.168.116.16:2379"
INITIAL_CLUSTER="etcd-1=https://192.168.116.15:2380,etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380"
INITIAL_CLUSTER_TOKEN="etcd-cluster"
INITIAL_CLUSTER_STATE="new"

etcd-3操作
[root@etcd-3 ~]# cat /opt/etcd/cfg/etcd.conf
DATA_DIR="/opt/etcd/data"
LISTEN_PEER_URLS="https://192.168.116.17:2380"
LISTEN_CLIENT_URLS="https://192.168.116.17:2379"
ADVERTISE_CLIENT_URLS="https://192.168.116.17:2379"
INITIAL_CLUSTER_TOKEN="etcd-cluster"
NAME="etcd-3"
INITIAL_CLUSTER="etcd-2=https://192.168.116.16:2380,etcd-3=https://192.168.116.17:2380,etcd-1=https://192.168.116.15:2380"
INITIAL_ADVERTISE_PEER_URLS="https://192.168.116.17:2380"
INITIAL_CLUSTER_STATE="new"

依次启动
[root@etcd-1 ~]# systemctl start etcd

#验证
[root@etcd-1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" -w table endpoint status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.116.15:2379 | b5779181c59c3700 |  3.4.10 |   25 kB |     false |      false |        41 |          9 |                  9 |        |
| https://192.168.116.16:2379 | 38e677eb51b6e690 |  3.4.10 |   20 kB |     false |      false |        41 |          9 |                  9 |        |
| https://192.168.116.17:2379 | 630a75ff7591bbf7 |  3.4.10 |   25 kB |      true |      false |        41 |          9 |                  9 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

验证数据是否恢复
[root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get / --prefix --keys-only | egrep -v '^$' | wc -l
24172

可以看到数据是已经恢复成功了

```

#### kubernetes etcd恢复顺序

停止kube-apiserver --> 停止ETCD --> 恢复数据 --> 启动ETCD --> 启动kube-apiserve

## 历史数据压缩

key空间长期的时候，如果没有做压缩清理，到达上限的阈值时，集群会处于一个只能删除和读的状态，无法进行写操作。因此对集群的历史日志做一个压缩清理是很有必要。

数据压缩并不是清理现有数据，只是对数据的历史版本进行清理，清理后数据的历史版本将不能访问，但不会影响现有最新数据的访问。

### 手动压缩

```
压缩清理revision为10之前的历史数据
[root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get /registry/configmaps/kube-system/kube-flannel-cfg


reversion可以通过以下命令查看
[root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get /registry/configmaps/kube-system/kube-flannel-cfg -w json

#访问revision10之前的数据会提示已经不存在
[root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379" get /registry/configmaps/kube-system/kube-flannel-cfg --rev=9
{"level":"warn","ts":"2021-08-24T23:02:30.151+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-26a4a8df-de68-459d-9999-2ef8a949b43e/192.168.116.15:2379","attempt":0,"error":"rpc error: code = OutOfRange desc = etcdserver: mvcc: required revision has been compacted"}
Error: etcdserver: mvcc: required revision has been compacted

```

### 自动压缩

使用`--auto-compaction-retention=1`，表示每小时进行一次数据压缩。

### 碎片清理

进行compaction操作之后，旧的revision被压缩，会产生内部的碎片，内部碎片是指空闲状态的，能被后端使用但是仍然消耗存储空间的磁盘空间。去碎片化实际上是将存储空间还给文件系统。

```
[root@etcd-2 ~]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/client.pem --key=/opt/etcd/ssl/client-key.pem  --endpoints="https://192.168.116.15:2379,https://192.168.116.16:2379,https://192.168.116.17:2379"  defrag
```