作为群集的管理员,除其他任务外,您应该能够通过添加(或删除)新节点来还原故障节点并扩展(或收缩)群集。

在MySQL中,如果您的数据量不太大,可以使用mysqldump客户端执行逻辑备份是一种不错的选择。它是复制原始模式对象和数据的SQL语句。

但是,对于大量数据, 需要选择诸如MySQL Enterprise Backup之类的物理备份解决方案,速度更快,尤其是对于还原操作而言。

前提

  • MySQL版本: 8.0.16
  • 已经配置好 InnoDB集群
  • 假设你已经了解了MySQL组复制和MySQL InnoDB集群的相关概念

备注:

根据不用的MySQL InnoDB集群配置,某些操作可能会略有不同

节点恢复

  • MySQL InnoDB集群中有3个节点 – M1 / M2 / M3,运行在单主模式(single primary mode)
  • MySQL Router配置为启用3306端口上的R/W连接和3307端口上RO连接
  • M1为当前的主节点 (处于读写模式)
  • M2和M3为从节点 (即只读模式)
  • M1失败! 有部分表损坏并且无法修复
  • M2和M3为新的临时集群

目标是重建M1并将其加入到集群中

1.3个处于运行中的MySQL 8.0.16 InnoDB集群

 1$ mysqlsh clusterAdmin@{mysqlRouterIP}:3306 --cluster
 2...
 3
 4MySQL JS> cluster.status()
 5{
 6    "clusterName": "pocCluster", 
 7    "defaultReplicaSet": {
 8        "name": "default", 
 9        "primary": "M1:3306", 
10        "ssl": "REQUIRED", 
11        "status": "OK", 
12        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
13        "topology": {
14            "M1:3306": {
15                "address": "M1:3306", 
16                "mode": "R/W", 
17                "readReplicas": {}, 
18                "role": "HA", 
19                "status": "ONLINE", 
20                "version": "8.0.16"
21            }, 
22            "M2:3306": {
23                "address": "M2:3306", 
24                "mode": "R/O", 
25                "readReplicas": {}, 
26                "role": "HA", 
27                "status": "ONLINE", 
28                "version": "8.0.16"
29            }, 
30            "M3:3306": {
31                "address": "M3:3306", 
32                "mode": "R/O", 
33                "readReplicas": {}, 
34                "role": "HA", 
35                "status": "ONLINE", 
36                "version": "8.0.16"
37            }
38        }, 
39        "topologyMode": "Single-Primary"
40    }, 
41    "groupInformationSourceMember": "M1:3306"
42}

M1失败了 ( “MISSING”状态) :

 1MySQL JS> cluster.status()
 2{
 3    "clusterName": "pocCluster", 
 4    "defaultReplicaSet": {
 5        "name": "default", 
 6        "primary": "M2:3306", 
 7        "ssl": "REQUIRED", 
 8        "status": "OK_NO_TOLERANCE", 
 9        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active", 
10        "topology": {
11            "M1:3306": {
12                "address": "M1:3306", 
13                "mode": "n/a", 
14                "readReplicas": {}, 
15                "role": "HA", 
16                "status": "(MISSING)"
17            }, 
18            "M2:3306": {
19                "address": "M2:3306", 
20                "mode": "R/W", 
21                "readReplicas": {}, 
22                "role": "HA", 
23                "status": "ONLINE", 
24                "version": "8.0.16"
25            }, 
26            "M3:3306": {
27                "address": "M3:3306", 
28                "mode": "R/O", 
29                "readReplicas": {}, 
30                "role": "HA", 
31                "status": "ONLINE", 
32                "version": "8.0.16"
33            }
34        }, 
35        "topologyMode": "Single-Primary"
36    }, 
37    "groupInformationSourceMember": "M2:3306"
38}

M1为主节点. 集群启动了自动数据库故障转移并选择出新的主节点

不久之后,M1最终被修复,并再次成为集群的一部分。 为了尽量减少恢复时间,而不是使用上一个转储,我们希望重新进行一次。

让我们在使用自定义配置的3307端口上使用MySQL Router在从节点上进行新的转储 (我们也可以使用主节点)。

1mysqldump --defaults-file=/etc/my.cnf -u mdump -p -h {mysqlRouterIP} -P 3307 \
2    --all-databases --routines --events --single-transaction --flush-privileges \
3    --hex-blob --log-error=/var/log/mysqldump.log --result-file=/data/backups/dump.sql

备注:

需要授予mysqldump用户连接MySQL的最低权限,通常最低权限取决于您要转储的对象。

对于大多数典型的用法中,通常授予以下权限:

GRANT SELECT, SHOW VIEW, EVENT, TRIGGER, LOCK TABLES, CREATE, ALTER, RELOAD, REPLICATION CLIENT, REPLICATION_SLAVE_ADMIN ON . TO

my.cnf之外,强烈建议在备份过程中同时备份所有节点的auto.cnfmysqld-auto.cnf

如果不小心丢失了auto.cnf文件,不用担心,服务器会生成一个新的文件,但是恢复过程将略有不同。

备份完成之后,我们准备在M1上还原数据了。

由于此服务器是MySQL InnoDB集群的一部分,因此与独立服务器还原相比,还需要一些额外的步骤。

1.还原数据

首先, 还原数据到M1:

  • 由于是逻辑还原,因此还原服务器必须已启动并运行
  • 停止组复制插件
    • STOP GROUP_REPLICATION;
  • 禁用记录bin log日志
    • SET SQL_LOG_BIN=0;
  • 删除bin log日志文件
    • RESET MASTER;
  • 清除Master信息和中继日志存储库,并删除所有中继日志(relay log)文件
    • RESET SLAVE;
  • 启用更新
    • SET GLOBAL super_read_only=0;
  • 执行还原
    • source /data/backups/dump.sql
 1mysqlsh root@M1:3306 --sql
 2...
 3
 4M1 SQL> STOP GROUP_REPLICATION;
 5Query OK, 0 rows affected (12.04 sec)
 6
 7M1 SQL> SET SQL_LOG_BIN=0;
 8Query OK, 0 rows affected (0.00 sec)
 9
10M1 SQL> RESET MASTER;
11Query OK, 0 rows affected (0.06 sec)
12
13M1 SQL> RESET SLAVE;
14Query OK, 0 rows affected (0.13 sec)
15
16M1 SQL> SET GLOBAL super_read_only=0;
17Query OK, 0 rows affected (0.00 sec)
18
19M1 SQL> source /data/backups/dump.sql
20Query OK, 0 rows affected (0.00 sec)
21...

2.将节点放回集群

其次, 将节点重新放回集群 连接到主节点上的MySQL Router (我这边是3306端口):

 1$ mysqlsh clusterAdmin@{mysqlRouterIP}:3306 --cluster
 2...
 3
 4MySQL JS> cluster.rejoinInstance("clusterAdmin@M1:3306")
 5Rejoining the instance to the InnoDB cluster. Depending on the original problem that made the instance unavailable, the rejoin operation might not be successful and further manual steps will be needed to fix the underlying problem.
 6
 7Please monitor the output of the rejoin operation and take necessary action if the instance cannot rejoin.
 8
 9Rejoining instance to the cluster ...
10
11The instance 'M1:3306' was successfully rejoined on the cluster.
12

再次检查集群的状态信息

 1MySQL JS> cluster.status()
 2{
 3    "clusterName": "pocCluster",
 4    "defaultReplicaSet": {
 5        "name": "default",
 6        "primary": "M2:3306",
 7        "ssl": "REQUIRED",
 8        "status": "OK",
 9        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
10        "topology": {
11            "M1:3306": {
12                "address": "M1:3306",
13                "mode": "R/O",
14                "readReplicas": {},
15                "role": "HA",
16                "status": "ONLINE",
17                "version": "8.0.16"
18            }, 
19            "M2:3306": {
20                "address": "M2:3306",
21                "mode": "R/W",
22                "readReplicas": {},
23                "role": "HA",
24                "status": "ONLINE",
25                "version": "8.0.16"
26            },
27            "M3:3306": {
28                "address": "M3:3306",
29                "mode": "R/O",
30                "readReplicas": {},
31                "role": "HA",
32                "status": "ONLINE",
33                "version": "8.0.16"
34            }
35        },
36        "topologyMode": "Single-Primary"
37    },
38    "groupInformationSourceMember": "M2:3306"
39}

注意:

恢复节点的集群状态将在变为“ONLINE”之前处于“RECOVERING”状态。

3.丢失auto.cnf文件的还原

如之前所讲,不会还原 auto.cnf 配置文件。 实际上,在这种情况下,集群将该节点视为新的节点 (因为服务器将生成新的UUID)。 因此,重新添加到集群的过程是不同的。

另外,还需要注意的是,如果遗失 mysqld-auto.cnf 文件,则可能需要将服务器(再次)配置为支持组复制f。

因此,基本上,该过程需要进行一些清理,然后添加旧节点,就像添加节点一样。

假设组复制插件在M1停止:

  1// Check the configuration even more important if you lost mysqld-auto.cnf :)
  2MySQL JS> dba.checkInstanceConfiguration('clusterAdmin@M1:3306')
  3Validating MySQL instance at M1:3306 for use in an InnoDB cluster...
  4
  5This instance reports its own address as M1
  6
  7Checking whether existing tables comply with Group Replication requirements...
  8No incompatible tables detected
  9
 10Checking instance configuration...
 11Instance configuration is compatible with InnoDB cluster
 12
 13The instance 'M1:3306' is valid for InnoDB cluster usage.
 14
 15{
 16    "status": "ok"
 17}
 18
 19
 20// If needed configure your instance
 21MySQL JS> dba.configureInstance('clusterAdmin@M1:3306')
 22
 23
 24// Remove the old node from the cluster metadata
 25MySQL JS> cluster.rescan()
 26Rescanning the cluster...
 27
 28Result of the rescanning operation for the 'default' ReplicaSet:
 29{
 30    "name": "default",
 31    "newTopologyMode": null,
 32    "newlyDiscoveredInstances": [],
 33    "unavailableInstances": [
 34        {
 35            "host": "M1:3306",
 36            "label": "M1:3306",
 37            "member_id": "a3f1ee50-9be3-11e9-a3fe-0242ac13000b"
 38        }
 39    ]
 40}
 41
 42The instance 'M1:3306' is no longer part of the ReplicaSet.
 43The instance is either offline or left the HA group. You can try to add it to the cluster again with the cluster.rejoinInstance('M1:3306') command or you can remove it from the cluster configuration.
 44Would you like to remove it from the cluster metadata? [Y/n]: Y
 45Removing instance from the cluster metadata...
 46The instance 'M1:3306' was successfully removed from the cluster metadata.
 47
 48
 49// Add the new instance
 50MySQL JS> cluster.addInstance("clusterAdmin@M1:3306")
 51A new instance will be added to the InnoDB cluster. Depending on the amount of data on the cluster this might take from a few seconds to several hours.
 52
 53Adding instance to the cluster ...
 54
 55Validating instance at M1:3306...
 56
 57This instance reports its own address as M1
 58
 59Instance configuration is suitable.
 60The instance 'clusterAdmin@M1:3306' was successfully added to the cluster.
 61
 62
 63// Check
 64MySQL JS> cluster.status()
 65{
 66    "clusterName": "pocCluster",
 67    "defaultReplicaSet": {
 68        "name": "default",
 69        "primary": "M2:3306",
 70        "ssl": "REQUIRED",
 71        "status": "OK",
 72        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
 73        "topology": {
 74            "M1:3306": {
 75                "address": "M1:3306",
 76                "mode": "R/O",
 77                "readReplicas": {},
 78                "role": "HA",
 79                "status": "ONLINE",
 80                "version": "8.0.16"
 81            },
 82            "M2:3306": {
 83                "address": "M2:3306",
 84                "mode": "R/W",
 85                "readReplicas": {},
 86                "role": "HA",
 87                "status": "ONLINE",
 88                "version": "8.0.16"
 89            }, 
 90            "M3:3306": {
 91                "address": "M3:3306", 
 92                "mode": "R/O", 
 93                "readReplicas": {}, 
 94                "role": "HA", 
 95                "status": "ONLINE", 
 96                "version": "8.0.16"
 97            }
 98        }, 
 99        "topologyMode": "Single-Primary"
100    }, 
101    "groupInformationSourceMember": "M2:3306"
102}

等等!

节点配置

  • 一个3节点的MySQL InnoDB集群 – M1 / M2 / M3处于单主模式
  • MySQL Router配置为启用3306端口上的R/W连接和3307端口上的RO连接
  • M2为当前读写模式下的主节点
  • M1和M3当前为只读模式下的从节点

目标:添加2个新节点到集群中: M4和M5

我们在第一部分中使用了3个节点的MySQL 8.0.16 InnoDB集群,目前已经启动并在运行。

实际上,添加新节点与我们之前所做的非常类似。

具体过程如下:

  • 部署新的MySQL实例,最好已经配置为组复制
  • 以前面的方式还原数据到新的MySQL实例

可以分别使用dba.checkInstanceConfiguration()dba.configure()函数来检查配置和配置本身(使用checkInstanceState()也可能很有用)。

例如在节点M4上进行操作:

 1$ mysqlsh clusterAdmin@M4:3306 -- dba checkInstanceConfiguration
 2Validating MySQL instance at M4:3306 for use in an InnoDB cluster...
 3
 4This instance reports its own address as M4
 5
 6Checking whether existing tables comply with Group Replication requirements...
 7No incompatible tables detected
 8
 9Checking instance configuration...
10Instance configuration is compatible with InnoDB cluster
11
12The instance 'M4:3306' is valid for InnoDB cluster usage.
13
14{
15    "status": "ok"
16}

1.还原数据

还原过程的第一部分与我们前面看到的相同

 1$ mysqlsh root@M4:3306 --sql
 2...
 3
 4M4 SQL> STOP GROUP_REPLICATION; -- if necessary
 5Query OK, 0 rows affected (12.04 sec)
 6
 7M4 SQL> SET SQL_LOG_BIN=0;
 8Query OK, 0 rows affected (0.00 sec)
 9
10M4 SQL> RESET MASTER; 
11Query OK, 0 rows affected (0.06 sec)
12
13M4 SQL> RESET SLAVE;
14Query OK, 0 rows affected (0.13 sec)
15
16M4 SQL> SET GLOBAL super_read_only=0;
17Query OK, 0 rows affected (0.00 sec)
18
19M4 SQL> source /data/backups/dump.sql
20Query OK, 0 rows affected (0.00 sec)
21...

2.将节点加入到集群中

第二部分,我们添加新的节点(M4) :

 1// Add the new instance
 2MySQL JS> cluster.addInstance("clusterAdmin@M4:3306")
 3
 4
 5// Check
 6MySQL JS> cluster.status()
 7{
 8    "clusterName": "pocCluster",
 9    "defaultReplicaSet": {
10        "name": "default",
11        "primary": "M2:3306",
12        "ssl": "REQUIRED",
13        "status": "OK",
14        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
15        "topology": {
16            "M1:3306": {
17                "address": "M1:3306",
18                "mode": "R/O",
19                "readReplicas": {},
20                "role": "HA",
21                "status": "ONLINE",
22                "version": "8.0.16"
23            },
24            "M2:3306": {
25                "address": "M2:3306",
26                "mode": "R/W",
27                "readReplicas": {},
28                "role": "HA",
29                "status": "ONLINE",
30                "version": "8.0.16"
31            },
32            "M3:3306": {
33                "address": "M3:3306",
34                "mode": "R/O",
35                "readReplicas": {},
36                "role": "HA",
37                "status": "ONLINE",
38                "version": "8.0.16"
39            },
40            "M4:3306": {
41                "address": "M4:3306",
42                "mode": "R/O",
43                "readReplicas": {},
44                "role": "HA",
45                "status": "ONLINE",
46                "version": "8.0.16"
47            }
48        },
49        "topologyMode": "Single-Primary"
50    },
51    "groupInformationSourceMember": "M2:3306"
52}

注意:

如果有必要,可以在addInstance()执行之前执行checkInstanceConfiguration()configureInstance()

最后一个节点M5的处理方式相同。

最后,您将看到5个节点的MySQL InnoDB集群:

 1MySQL JS> cluster.status()
 2{
 3    "clusterName": "pocCluster", 
 4    "defaultReplicaSet": {
 5        "name": "default", 
 6        "primary": "M2:3306", 
 7        "ssl": "REQUIRED", 
 8        "status": "OK", 
 9        "statusText": "Cluster is ONLINE and can tolerate up to 2 failures.", 
10        "topology": {
11            "M1:3306": {
12                "address": "M1:3306", 
13                "mode": "R/O", 
14                "readReplicas": {}, 
15                "role": "HA", 
16                "status": "ONLINE", 
17                "version": "8.0.16"
18            }, 
19            "M2:3306": {
20                "address": "M2:3306", 
21                "mode": "R/W", 
22                "readReplicas": {}, 
23                "role": "HA", 
24                "status": "ONLINE", 
25                "version": "8.0.16"
26            }, 
27            "M3:3306": {
28                "address": "M3:3306", 
29                "mode": "R/O", 
30                "readReplicas": {}, 
31                "role": "HA", 
32                "status": "ONLINE", 
33                "version": "8.0.16"
34            }, 
35            "M4:3306": {
36                "address": "M4:3306", 
37                "mode": "R/O", 
38                "readReplicas": {}, 
39                "role": "HA", 
40                "status": "ONLINE", 
41                "version": "8.0.16"
42            }, 
43            "M5:3306": {
44                "address": "M5:3306", 
45                "mode": "R/O", 
46                "readReplicas": {}, 
47                "role": "HA", 
48                "status": "ONLINE", 
49                "version": "8.0.16"
50            }
51        }, 
52        "topologyMode": "Single-Primary"
53    },
54    "groupInformationSourceMember": "M2:3306"
55}

总结

当您的数据量不大时,使用mysqldump进行节点恢复和配置的一种方法。

当逻辑备份效率不高时,就该使用热备、在线无阻塞的物理备份工具,如 MySQL Enterprise Backup

参考资料

其他

Node 1 – 组复制配置

 1mysql> SHOW VARIABLES LIKE 'group_replication%';
 2+-----------------------------------------------------+-------------------------------------+
 3| Variable_name                                       | Value                               |
 4+-----------------------------------------------------+-------------------------------------+
 5| group_replication_allow_local_lower_version_join    | OFF                                 |
 6| group_replication_auto_increment_increment          | 7                                   |
 7| group_replication_autorejoin_tries                  | 1                                   |
 8| group_replication_bootstrap_group                   | OFF                                 |
 9| group_replication_communication_debug_options       | GCS_DEBUG_NONE                      |
10| group_replication_communication_max_message_size    | 10485760                            |
11| group_replication_components_stop_timeout           | 31536000                            |
12| group_replication_compression_threshold             | 1000000                             |
13| group_replication_consistency                       | EVENTUAL                            |
14| group_replication_enforce_update_everywhere_checks  | OFF                                 |
15| group_replication_exit_state_action                 | READ_ONLY                           |
16| group_replication_flow_control_applier_threshold    | 25000                               |
17| group_replication_flow_control_certifier_threshold  | 25000                               |
18| group_replication_flow_control_hold_percent         | 10                                  |
19| group_replication_flow_control_max_quota            | 0                                   |
20| group_replication_flow_control_member_quota_percent | 0                                   |
21| group_replication_flow_control_min_quota            | 0                                   |
22| group_replication_flow_control_min_recovery_quota   | 0                                   |
23| group_replication_flow_control_mode                 | QUOTA                               |
24| group_replication_flow_control_period               | 1                                   |
25| group_replication_flow_control_release_percent      | 50                                  |
26| group_replication_force_members                     |                                     |
27| group_replication_group_name                        | d1b109bf-9be3-11e9-9ea2-0242ac13000b|
28| group_replication_group_seeds                       | M2:33061,M3:33061,M4:33061,M5:33061 |
29| group_replication_gtid_assignment_block_size        | 1000000                             |
30| group_replication_ip_whitelist                      | AUTOMATIC                           |
31| group_replication_local_address                     | M1:33061                            |
32| group_replication_member_expel_timeout              | 0                                   |
33| group_replication_member_weight                     | 50                                  |
34| group_replication_message_cache_size                | 1073741824                          |
35| group_replication_poll_spin_loops                   | 0                                   |
36| group_replication_recovery_complete_at              | TRANSACTIONS_APPLIED                |
37| group_replication_recovery_get_public_key           | OFF                                 |
38| group_replication_recovery_public_key_path          |                                     |
39| group_replication_recovery_reconnect_interval       | 60                                  |
40| group_replication_recovery_retry_count              | 10                                  |
41| group_replication_recovery_ssl_ca                   |                                     |
42| group_replication_recovery_ssl_capath               |                                     |
43| group_replication_recovery_ssl_cert                 |                                     |
44| group_replication_recovery_ssl_cipher               |                                     |
45| group_replication_recovery_ssl_crl                  |                                     |
46| group_replication_recovery_ssl_crlpath              |                                     |
47| group_replication_recovery_ssl_key                  |                                     |
48| group_replication_recovery_ssl_verify_server_cert   | OFF                                 |
49| group_replication_recovery_use_ssl                  | ON                                  |
50| group_replication_single_primary_mode               | ON                                  |
51| group_replication_ssl_mode                          | REQUIRED                            |
52| group_replication_start_on_boot                     | ON                                  |
53| group_replication_transaction_size_limit            | 150000000                           |
54| group_replication_unreachable_majority_timeout      | 0                                   |
55+-----------------------------------------------------+-------------------------------------+