etcd 节点迁移

etcd 的节点变更有两种方式变更, 一种是数据迁移, 一种是通过增加新节点, 同步数据完成后, 删除老节点来实现的. 本篇文章介绍前者, 通过数据目录的迁移, 来实现etcd节点的迁移

实验环境

当前etcd集群信息

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list
1161d5b4260241e3, started, lv-etcd-research-alpha-1, http://192.168.149.60:2380, http://192.168.149.60:2379
4252aec339d438d9, started, lv-etcd-research-alpha-3, http://192.168.149.62:2380, http://192.168.149.62:2379
e6f45ed7d9402b75, started, lv-etcd-research-alpha-2, http://192.168.149.61:2380, http://192.168.149.61:2379

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://192.168.149.60:2379 | 1161d5b4260241e3 |  3.2.28 |   18 MB |     false |         7 |     124802 |
| http://192.168.149.62:2379 | 4252aec339d438d9 |  3.2.28 |   18 MB |     false |         7 |     124802 |
| http://192.168.149.61:2379 | e6f45ed7d9402b75 |  3.2.28 |   18 MB |      true |         7 |     124802 |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+

本次目标是将 192.168.149.60 节点迁移到 192.168.149.63 节点

总体迁移步骤

先在192.168.149.60上停止etcd服务, 如果该进程已经挂掉, 也就省去了停止etcd的步骤了😆 前提是你必须要保证, 它挂的很彻底, 不要迁移了一半又自己活过来…
从老机器上迁移数据到新机器对应目录
在任意节点执行member update操作, 更新peerURLs信息为新机器的 IP:Port
从老机器上将配置文件一并拷贝到新机器, 修改成新机器IP地址后, 保证指向的数据目录正确, 启动即可

Step 1: 停服务

在需要迁移的节点上, kill掉etcd的进程, 如果条件允许, 不要-9, 优雅关闭优先

1	systemctl stop etcd

此时查询集群状态:

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table
Failed to get the status of endpoint http://192.168.149.60:2379 (context deadline exceeded)
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://192.168.149.62:2379 | 4252aec339d438d9 |  3.2.28 |   18 MB |     false |         7 |     124940 |
| http://192.168.149.61:2379 | e6f45ed7d9402b75 |  3.2.28 |   18 MB |      true |         7 |     124940 |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+

192.168.149.60 节点已经处于失联状态

Step 2: 迁移数据目录

在192.168.149.63上执行(预建数据目录)

1	> mkdir -p /var/lib/etcd/default.etcd

在192.168.149.60上执行(打包发送)

1 2	> tar -cvzf member.tar.gz member > scp member.tar.gz root@192.168.149.63:/var/lib/etcd/default.etcd/

在192.168.149.63上执行(解压)

1 2	> cd /var/lib/etcd/default.etcd/ > tar -xvzf member.tar.gz

Step 3: 更新member信息

在任意一个节点执行, 已更新原节点 peerURLs 信息

1 2	> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.61:2379 member update 1161d5b4260241e3 --peer-urls="http://192.168.149.63:2380" Member 1161d5b4260241e3 updated in cluster 2c25150e88501a13

--endpoints http://192.168.149.61:2379 因为 192.168.149.60 节点已停止服务, 所以这里需要选择一个其他的endpoint节点来对集群进行操作

1161d5b4260241e3 是 192.168.149.60 的节点ID, 如果忘记的话, 可以执行 member list查看

回显显示命令已正确执行, 查询状态如下:

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list -w table
+------------------+---------+--------------------------+----------------------------+----------------------------+
|        ID        | STATUS  |           NAME           |         PEER ADDRS         |        CLIENT ADDRS        |
+------------------+---------+--------------------------+----------------------------+----------------------------+
| 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.60:2379 |
| 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 |
| e6f45ed7d9402b75 | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 |
+------------------+---------+--------------------------+----------------------------+----------------------------+

可以看到第一行, PEER ADDRS 已经正确更新成为 http://192.168.149.63:2380, 但是后面的 CLIENT ADDRS 依然是原来的 http://192.168.149.60:2379. 这个不用担心, 等新的节点启动后, 这个值就会变成正确的地址

Step 4: 在新节点启动服务

在新节点启动服务之前, 记得把配置文件, 从老节点拷贝过去. 拷贝完成后, 一定要参数进行修改.

# 以下两个参数如果指定了: 0.0.0.0 就无需更改, 如果是精确指定每个IP地址, 则需要将IP60更改为63
ETCD_LISTEN_PEER_URLS
ETCD_LISTEN_CLIENT_URLS

# 以下两个参数注意修改IP地址到新机器的IP
ETCD_INITIAL_ADVERTISE_PEER_URLS
ETCD_ADVERTISE_CLIENT_URLS

# 以下集群信息中, 记得也将原IP修改为新机器的IP地址
ETCD_INITIAL_CLUSTER

以上修改的参数中, ETCD_LISTEN_PEER_URLS ETCD_LISTEN_CLIENT_URLS ETCD_ADVERTISE_CLIENT_URLS 是最重要的参数, 一定要和新机器的IP地址匹配

因为etcd是运行时重新配置, 另外两个 INIT 的参数虽然在服务启动的时候不再起什么作用了, 但是为了后期看到配置文件后不知道迷茫, 也最好都统一修改到新机器的IP地址

同理 ETCD_INITIAL_CLUSTER_STATE="new" 参数可以保留, 因为不起作用

启动etcd服务

1	> systemctl start etcd

集群状态:

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://192.168.149.63:2379 | 1161d5b4260241e3 |  3.2.28 |   18 MB |     false |         7 |     128011 |
| http://192.168.149.62:2379 | 4252aec339d438d9 |  3.2.28 |   18 MB |     false |         7 |     128011 |
| http://192.168.149.61:2379 | e6f45ed7d9402b75 |  3.2.28 |   18 MB |      true |         7 |     128011 |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+

> ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list -w table
+------------------+---------+--------------------------+----------------------------+----------------------------+
|        ID        | STATUS  |           NAME           |         PEER ADDRS         |        CLIENT ADDRS        |
+------------------+---------+--------------------------+----------------------------+----------------------------+
| 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.63:2379 |
| 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 |
| e6f45ed7d9402b75 | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 |
+------------------+---------+--------------------------+----------------------------+----------------------------+

可以看到”新的集群” RAFT INDEX 已经一致, 表示新节点192.168.149.63已经追上集群数据.

PEER ADDRS 和 CLIENT ADDRS 也均为正确的地址

此时, 节点迁移正确完成

总结:

这种迁移方式基本仅在待迁移的节点还能正常登陆, 还能正常访问数据目录的前提下进行. 如果机器已经挂掉, 无法访问到原有数据, 那么这种方式并不合适. 迁移嘛, 都正常才能迁移, 不正常的迁移叫故障恢复😆