etcd 灾难恢复

etcd的灾难恢复, 需要使用到快照文件, 本质是一次恢复快照的操作. 当etcd集群的大多数节点永久失联或短时间内无法继续正常使用时, 要想恢复etcd集群的服务, 需要执行etcd的快照恢复操作. 注意v2和v3版本数据恢复是两码事儿, 本篇重点介绍v3版本的数据备份和恢复

实验环境

前篇文章中留存的实验环境:

1
2
3
4
5
1161d5b4260241e3, started, lv-etcd-research-alpha-1, http://192.168.149.63:2380, http://192.168.149.63:2379
2145c204a51dbbc7, started, lv-etcd-research-alpha-0, http://192.168.149.60:2380, http://192.168.149.60:2379
4252aec339d438d9, started, lv-etcd-research-alpha-3, http://192.168.149.62:2380, http://192.168.149.62:2379
e26482910894af8d, started, lv-etcd-research-alpha-2, http://192.168.149.61:2380, http://192.168.149.61:2379
ea04db3353b9fd4e, started, lv-etcd-research-alpha-4, http://192.168.149.64:2380, http://192.168.149.64:2379

本篇文章将以此环境为基础, 使用备份的快照文件, 创建新的集群

整体步骤

  • 创建快照文件
  • 将快照文件分发到新集群的每一台主机上
  • 使用 etcdctl snapshot restore 命令启动临时逻辑集群, 在新的数据目录中恢复数据
  • 使用新的数据目录, 启动etcd服务

创建快照文件

在此假设集群中的5个节点, 仅剩192.168.149.60存活, 我们需要首先在存活的节点, 将数据导出(快照)

当然, 正常情况下, 生产环境会定期对etcd做快照备份, 对于这种情况, 直接拿最新的一份快照恢复即可

1
2
> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379 snapshot save snapshot.db
Snapshot saved at snapshot.db

分发快照文件

第二步, 需要分发快照文件到新集群的机器上. 结合在我目前的环境中, 我需要把该快照文件发送到

  • 192.168.149.61
  • 192.168.149.62
  • 192.168.149.63
  • 192.168.149.64

在实际环境中, 可能另外四台, 或者原集群全部主机都无法使用, 此时需要将之前备份的快照文件, 从备份服务器下载到组建新集群的各个主机上

1
2
3
4
5
6
7
8
> scp snapshot.db root@192.168.149.61:/var/lib/etcd/
snapshot.db 100% 20MB 42.1MB/s 00:00
> scp snapshot.db root@192.168.149.62:/var/lib/etcd/
snapshot.db 100% 20MB 30.3MB/s 00:00
> scp snapshot.db root@192.168.149.63:/var/lib/etcd/
snapshot.db 100% 20MB 36.9MB/s 00:00
> scp snapshot.db root@192.168.149.64:/var/lib/etcd/
snapshot.db

恢复快照

192.168.149.60 上执行

1
2
3
4
5
6
7
cd /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir="/var/lib/etcd/new.etcd" \
--name="lv-etcd-research-beta-0" \
--initial-advertise-peer-urls="http://192.168.149.60:22380" \
--initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \
--initial-cluster-token="lv-etcd-research-beta-temp"

执行结果

1
2
3
4
5
2019-06-14 14:47:34.172213 I | etcdserver/membership: added member 6914761fd26729d7 [http://192.168.149.62:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:47:34.172370 I | etcdserver/membership: added member b8ca704ce48fc6c2 [http://192.168.149.63:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:47:34.172419 I | etcdserver/membership: added member bff8d73529095f70 [http://192.168.149.64:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:47:34.172464 I | etcdserver/membership: added member eb548d413adb4560 [http://192.168.149.61:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:47:34.172506 I | etcdserver/membership: added member fee07bdb23e26b2f [http://192.168.149.60:22380] to cluster c1cdf0b2061f8dcc

192.168.149.63 上执行

1
2
3
4
5
6
7
cd /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir="/var/lib/etcd/new.etcd" \
--name="lv-etcd-research-beta-1" \
--initial-advertise-peer-urls="http://192.168.149.63:22380" \
--initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \
--initial-cluster-token="lv-etcd-research-beta-temp"

执行结果:

1
2
3
4
5
2019-06-14 14:49:23.896672 I | etcdserver/membership: added member 6914761fd26729d7 [http://192.168.149.62:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:49:23.897112 I | etcdserver/membership: added member b8ca704ce48fc6c2 [http://192.168.149.63:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:49:23.897210 I | etcdserver/membership: added member bff8d73529095f70 [http://192.168.149.64:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:49:23.897264 I | etcdserver/membership: added member eb548d413adb4560 [http://192.168.149.61:22380] to cluster c1cdf0b2061f8dcc
2019-06-14 14:49:23.897403 I | etcdserver/membership: added member fee07bdb23e26b2f [http://192.168.149.60:22380] to cluster c1cdf0b2061f8dcc

192.168.149.61 上执行

1
2
3
4
5
6
7
cd /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir="/var/lib/etcd/new.etcd" \
--name="lv-etcd-research-beta-2" \
--initial-advertise-peer-urls="http://192.168.149.61:22380" \
--initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \
--initial-cluster-token="lv-etcd-research-beta-temp"

执行结果同上


192.168.149.62 上执行

1
2
3
4
5
6
7
cd /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir="/var/lib/etcd/new.etcd" \
--name="lv-etcd-research-beta-3" \
--initial-advertise-peer-urls="http://192.168.149.62:22380" \
--initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \
--initial-cluster-token="lv-etcd-research-beta-temp"

执行结果同上


192.168.149.64 上执行

1
2
3
4
5
6
7
cd /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir="/var/lib/etcd/new.etcd" \
--name="lv-etcd-research-beta-4" \
--initial-advertise-peer-urls="http://192.168.149.64:22380" \
--initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \
--initial-cluster-token="lv-etcd-research-beta-temp"

执行结果同上

本步骤操作, 将快照中的数据写入到指定的文件夹下, 并写入新集群的元数据信息. 这里需要注意的是, 快照中的数据是干净的数据, 不包含原节点的节点ID和集群ID等元数据信息. 执行 restore 操作后, 集群信息由命令后面的参数决定, 所以后续所有的节点, 仅需要指定新的数据目录启动即可, 集群信息可不指定, 因为已经写入到db中.

启动新集群

修改etcd 配置文件, 指定新的数据目录来启动服务. 在我这里的实验环境中, 由于老etcd集群都在正常运行中, 我这里通过指定不同的端口, 在原有的5台机器中启动第二套新集群, 验证恢复操作

192.168.149.60 上执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 创建新的配置文件
echo 'ETCD_DATA_DIR="/var/lib/etcd/new.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:22380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:22379"
ETCD_NAME="lv-etcd-research-beta-0"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.149.60:22379"
ETCD_INITIAL_CLUSTER_TOKEN="lv-etcd-research-beta"' > /etc/etcd/etcd_new.conf

# 复制原有的启动文件
cp /usr/lib/systemd/system/etcd.service /usr/lib/systemd/system/etcd_new.service
# 更改启动文件中指定的配置文件
sed -i s/etcd.conf/etcd_new.conf/g /usr/lib/systemd/system/etcd_new.service

systemctl daemon-reload
systemctl start etcd_new

其他四台以此类推…

检查原集群和新集群成员

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379 member list -w table
+------------------+---------+--------------------------+----------------------------+----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+--------------------------+----------------------------+----------------------------+
| 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.63:2379 |
| 2145c204a51dbbc7 | started | lv-etcd-research-alpha-0 | http://192.168.149.60:2380 | http://192.168.149.60:2379 |
| 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 |
| e26482910894af8d | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 |
| ea04db3353b9fd4e | started | lv-etcd-research-alpha-4 | http://192.168.149.64:2380 | http://192.168.149.64:2379 |
+------------------+---------+--------------------------+----------------------------+----------------------------+

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:22379 member list -w table
+------------------+---------+-------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-------------------------+-----------------------------+-----------------------------+
| 6914761fd26729d7 | started | lv-etcd-research-beta-3 | http://192.168.149.62:22380 | http://192.168.149.63:22379 |
| b8ca704ce48fc6c2 | started | lv-etcd-research-beta-1 | http://192.168.149.63:22380 | http://192.168.149.63:22379 |
| bff8d73529095f70 | started | lv-etcd-research-beta-4 | http://192.168.149.64:22380 | http://192.168.149.64:22379 |
| eb548d413adb4560 | started | lv-etcd-research-beta-2 | http://192.168.149.61:22380 | http://192.168.149.61:22379 |
| fee07bdb23e26b2f | started | lv-etcd-research-beta-0 | http://192.168.149.60:22380 | http://192.168.149.60:22379 |
+------------------+---------+-------------------------+-----------------------------+-----------------------------+

注意v2与v3区别

官方不建议v2和v3混合使用, 也就是说, 如果你既有v2的存储需求又有v3的存储需求, 最好应该是用两个独立的集群将需求隔离开. 在备份恢复这个操作上, 也充分体现了v2 v3不要混用的重要性. 因为以上操作都是针对v3版本的备份和恢复. 即使备份的etcd集群中存在v2的数据, 在使用该方案恢复后, v2的数据将不会出现在新的集群中.

如果你需要针对v2版本做备份和恢复, 可以参考官方文档:

https://etcd.io/docs/v2/admin_guide/#disaster-recovery

大致步骤如下:

  • 使用 etcdctl backup 命令备份数据到新的目录
  • 使用新的目录, 以 --force-new-cluster 的模式指定新的目录, 启动单节点etcd服务
  • 如果需要恢复的是一个集群, 你需要先执行 ETCDCTL_API=2 etcdctl member update 命令更新
  • 最后按照正常运行时配置添加节点即可