Orchestrator核心之失败类型判定

6 篇文章 0 订阅
5 篇文章 3 订阅

Orchestra失败类型

我们可以从官网上了解到,orchestrator有很多中失败类型,如DeadMaster、DeadMasterAndReplicas、DeadMasterAndSomeReplicas等等。那么,它是如何来判断这些类型的呢,今天一起走进orchestrator的内心世界。

orchestrator数据采集过程

要知道orchestrator如何判定失败类型,那么首先要知悉orchestrator的探测或检测过程,这部分的内容,会在单独的一篇文章中写,这篇文章主要讨论的是失败类型的判定。
粗略的探测过程就是使用discoverinstance()函数去每个实例上探测内容,位于orchestrator.go文件中。

失败类型

最常见的失败类型,莫过于DeadMaster或DeadMasterAndReplicas这几种类型。下面是代码分析。

代码分析

判定代码位于inst/analysus_dao,go文件中,大致位于491行开始。
下面是取实例状态是一条SQL,非常复杂:

SELECT master_instance.hostname, master_instance.port, master_instance.read_only AS read_only, MIN(master_instance.data_center) AS data_center, MIN(master_instance.region) AS region, MIN(master_instance.physical_environment) AS physical_environment, MIN(master_instance.master_host) AS master_host, MIN(master_instance.master_port) AS master_port, MIN(master_instance.cluster_name) AS cluster_name, MIN(IFNULL(cluster_alias.alias, master_instance.cluster_name)) AS cluster_alias, MIN( master_instance.last_checked <= master_instance.last_seen and master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second ) = 1 AS is_last_check_valid, MIN(master_instance.last_check_partial_success) as last_check_partial_success, MIN(master_instance.master_host IN ('' , '_') OR master_instance.master_port = 0 OR substr(master_instance.master_host, 1, 2) = '//') AS is_master, MIN(master_instance.is_co_master) AS is_co_master, MIN(CONCAT(master_instance.hostname, ':', master_instance.port) = master_instance.cluster_name) AS is_cluster_master, MIN(master_instance.gtid_mode) AS gtid_mode, COUNT(replica_instance.server_id) AS count_replicas, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen), 0) AS count_valid_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running != 0 AND replica_instance.slave_sql_running != 0), 0) AS count_valid_replicating_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running = 0 AND replica_instance.last_io_error like '%error %connecting to master%' AND replica_instance.slave_sql_running = 1), 0) AS count_replicas_failing_to_connect_to_master, MIN(master_instance.replication_depth) AS replication_depth, GROUP_CONCAT(concat(replica_instance.Hostname, ':', replica_instance.Port)) as slave_hosts, MIN( master_instance.slave_sql_running = 1 AND master_instance.slave_io_running = 0 AND master_instance.last_io_error like '%error %connecting to master%' ) AS is_failing_to_connect_to_master, MIN( master_downtime.downtime_active is not null and ifnull(master_downtime.end_timestamp, now()) > now() ) AS is_downtimed, MIN( IFNULL(master_downtime.end_timestamp, '') ) AS downtime_end_timestamp, MIN( IFNULL(unix_timestamp() - unix_timestamp(master_downtime.end_timestamp), 0) ) AS downtime_remaining_seconds, MIN( master_instance.binlog_server ) AS is_binlog_server, MIN( master_instance.pseudo_gtid ) AS is_pseudo_gtid, MIN( master_instance.supports_oracle_gtid ) AS supports_oracle_gtid, SUM( replica_instance.oracle_gtid ) AS count_oracle_gtid_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.oracle_gtid != 0), 0) AS count_valid_oracle_gtid_slaves, SUM( replica_instance.binlog_server ) AS count_binlog_server_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.binlog_server != 0), 0) AS count_valid_binlog_server_slaves, MIN( master_instance.mariadb_gtid ) AS is_mariadb_gtid, SUM( replica_instance.mariadb_gtid ) AS count_mariadb_gtid_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.mariadb_gtid != 0), 0) AS count_valid_mariadb_gtid_slaves, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates), 0) AS count_logging_replicas, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates AND replica_instance.binlog_format = 'STATEMENT'), 0) AS count_statement_based_loggin_slaves, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates AND replica_instance.binlog_format = 'MIXED'), 0) AS count_mixed_based_loggin_slaves, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates AND replica_instance.binlog_format = 'ROW'), 0) AS count_row_based_loggin_slaves, IFNULL(SUM(replica_instance.sql_delay > 0), 0) AS count_delayed_replicas, IFNULL(SUM(replica_instance.slave_lag_seconds > 10), 0) AS count_lagging_replicas, IFNULL(MIN(replica_instance.gtid_mode), '') AS min_replica_gtid_mode, IFNULL(MAX(replica_instance.gtid_mode), '') AS max_replica_gtid_mode, IFNULL(MAX( case when replica_downtime.downtime_active is not null and ifnull(replica_downtime.end_timestamp, now()) > now() then '' else replica_instance.gtid_errant end ), '') AS max_replica_gtid_errant, IFNULL(SUM( replica_downtime.downtime_active is not null and ifnull(replica_downtime.end_timestamp, now()) > now()), 0) AS count_downtimed_replicas, COUNT(DISTINCT case when replica_instance.log_bin AND replica_instance.log_slave_updates then replica_instance.major_version else NULL end ) AS count_distinct_logging_major_versions FROM database_instance master_instance LEFT JOIN hostname_resolve ON (master_instance.hostname = hostname_resolve.hostname) LEFT JOIN database_instance replica_instance ON (COALESCE(hostname_resolve.resolved_hostname, master_instance.hostname) = replica_instance.master_host AND master_instance.port = replica_instance.master_port) LEFT JOIN database_instance_maintenance ON (master_instance.hostname = database_instance_maintenance.hostname AND master_instance.port = database_instance_maintenance.port AND database_instance_maintenance.maintenance_active = 1) LEFT JOIN database_instance_downtime as master_downtime ON (master_instance.hostname = master_downtime.hostname AND master_instance.port = master_downtime.port AND master_downtime.downtime_active = 1) LEFT JOIN database_instance_downtime as replica_downtime ON (replica_instance.hostname = replica_downtime.hostname AND replica_instance.port = replica_downtime.port AND replica_downtime.downtime_active = 1) LEFT JOIN cluster_alias ON (cluster_alias.cluster_name = master_instance.cluster_name) WHERE database_instance_maintenance.database_instance_maintenance_id IS NULL AND '' IN ('', master_instance.cluster_name) GROUP BY master_instance.hostname, master_instance.port HAVING (MIN( master_instance.last_checked <= master_instance.last_seen and master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second ) = 1 ) = 0 OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running = 0 AND replica_instance.last_io_error like '%error %connecting to master%' AND replica_instance.slave_sql_running = 1), 0) > 0) OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen), 0) < COUNT(replica_instance.server_id) ) OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running != 0 AND replica_instance.slave_sql_running != 0), 0) < COUNT(replica_instance.server_id) ) OR (MIN( master_instance.slave_sql_running = 1 AND master_instance.slave_io_running = 0 AND master_instance.last_io_error like '%error %connecting to master%' ) ) OR (COUNT(replica_instance.server_id) > 0) ORDER BY is_master DESC , is_cluster_master DESC, count_replicas DESCG

最重要的几个字断及相关sql取值,都是从database_instance基表取值的:

IsMaster(是否为主):
MIN(
			(
				master_instance.master_host IN ('', '_')
				OR master_instance.master_port = 0
				OR substr(master_instance.master_host, 1, 2) = '//'
			)
			AND (
				master_instance.replication_group_name = ''
				OR master_instance.replication_group_member_role = 'PRIMARY'
			)
		) AS is_master

LastCheckValid(最后检查失败):
MIN(
			master_instance.last_checked <= master_instance.last_seen
			and master_instance.last_attempted_check <= master_instance.last_seen + interval ? second
		) = 1 AS is_last_check_valid,
		
CountValidReplicas(不可用的从):
IFNULL(
			SUM(
				replica_instance.last_checked <= replica_instance.last_seen
			),
			0
		) AS count_valid_replicas,
		
CountReplicas(所有的从):
COUNT(replica_instance.server_id) AS count_replicas,

CountValidReplicatingReplicas(不可用,但在复制中的):
IFNULL(
			SUM(
				replica_instance.last_checked <= replica_instance.last_seen
				AND replica_instance.slave_io_running != 0
				AND replica_instance.slave_sql_running != 0
			),
			0
		) AS count_valid_replicating_replicas,

下面是判定的具体代码:

// 这里是探测到了失败,会打印相关的信息
if !a.LastCheckValid {
			analysisMessage := fmt.Sprintf("analysis: ClusterName: %+v, IsMaster: %+v, LastCheckValid: %+v, LastCheckPartialSuccess: %+v, CountReplicas: %+v, CountValidReplicas: %+v, CountValidReplicatingReplicas: %+v, CountLaggingReplicas: %+v, CountDelayedReplicas: %+v, CountReplicasFailingToConnectToMaster: %+v",
				a.ClusterDetails.ClusterName, a.IsMaster, a.LastCheckValid, a.LastCheckPartialSuccess, a.CountReplicas, a.CountValidReplicas, a.CountValidReplicatingReplicas, a.CountLaggingReplicas, a.CountDelayedReplicas, a.CountReplicasFailingToConnectToMaster,
			)
			if util.ClearToLog("analysis_dao", analysisMessage) {
				log.Debugf(analysisMessage)
			}
		}
		if !a.IsReplicationGroupMember /* Traditional Async/Semi-sync replication issue detection */ {
			if a.IsMaster && !a.LastCheckValid && a.CountReplicas == 0 {
				a.Analysis = DeadMasterWithoutReplicas
				a.Description = "Master cannot be reached by orchestrator and has no replica"
				//
			} else if a.IsMaster && !a.LastCheckValid && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadMaster
				a.Description = "Master cannot be reached by orchestrator and none of its replicas is replicating"
				//
			} else if a.IsMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadMasterAndReplicas
				a.Description = "Master cannot be reached by orchestrator and none of its replicas is replicating"
				//
			} else if a.IsMaster && !a.LastCheckValid && a.CountValidReplicas < a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadMasterAndSomeReplicas
				a.Description = "Master cannot be reached by orchestrator; some of its replicas are unreachable and none of its reachable replicas is replicating"
				//
			} else if a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 {
				a.Analysis = UnreachableMasterWithLaggingReplicas
				a.Description = "Master cannot be reached by orchestrator and all of its replicas are lagging"
				//
			} else if a.IsMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
				// partial success is here to redice noise
				a.Analysis = UnreachableMaster
				a.Description = "Master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue"
				//
			} else if a.IsMaster && !a.LastCheckValid && a.LastCheckPartialSuccess && a.CountReplicasFailingToConnectToMaster > 0 && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
				// there's partial success, but also at least one replica is failing to connect to master
				a.Analysis = UnreachableMaster
				a.Description = "Master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue"
				//
			} else if a.IsMaster && a.SemiSyncMasterEnabled && a.SemiSyncMasterStatus && a.SemiSyncMasterWaitForReplicaCount > 0 && a.SemiSyncMasterClients < a.SemiSyncMasterWaitForReplicaCount {
				if isStaleBinlogCoordinates {
					a.Analysis = LockedSemiSyncMaster
					a.Description = "Semi sync master is locked since it doesn't get enough replica acknowledgements"
				} else {
					a.Analysis = LockedSemiSyncMasterHypothesis
					a.Description = "Semi sync master seems to be locked, more samplings needed to validate"
				}
				//
			} else if a.IsMaster && a.LastCheckValid && a.CountReplicas == 1 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = MasterSingleReplicaNotReplicating
				a.Description = "Master is reachable but its single replica is not replicating"
				//
			} else if a.IsMaster && a.LastCheckValid && a.CountReplicas == 1 && a.CountValidReplicas == 0 {
				a.Analysis = MasterSingleReplicaDead
				a.Description = "Master is reachable but its single replica is dead"
				//
			} else if a.IsMaster && a.LastCheckValid && a.CountReplicas > 1 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = AllMasterReplicasNotReplicating
				a.Description = "Master is reachable but none of its replicas is replicating"
				//
			} else if a.IsMaster && a.LastCheckValid && a.CountReplicas > 1 && a.CountValidReplicas < a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = AllMasterReplicasNotReplicatingOrDead
				a.Description = "Master is reachable but none of its replicas is replicating"
				//
			} else /* co-master */ if a.IsCoMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadCoMaster
				a.Description = "Co-master cannot be reached by orchestrator and none of its replicas is replicating"
				//
			} else if a.IsCoMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas < a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadCoMasterAndSomeReplicas
				a.Description = "Co-master cannot be reached by orchestrator; some of its replicas are unreachable and none of its reachable replicas is replicating"
				//
			} else if a.IsCoMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
				a.Analysis = UnreachableCoMaster
				a.Description = "Co-master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue"
				//
			} else if a.IsCoMaster && a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = AllCoMasterReplicasNotReplicating
				a.Description = "Co-master is reachable but none of its replicas is replicating"
				//
			} else /* intermediate-master */ if !a.IsMaster && !a.LastCheckValid && a.CountReplicas == 1 && a.CountValidReplicas == a.CountReplicas && a.CountReplicasFailingToConnectToMaster == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadIntermediateMasterWithSingleReplicaFailingToConnect
				a.Description = "Intermediate master cannot be reached by orchestrator and its (single) replica is failing to connect"
				//
			} else if !a.IsMaster && !a.LastCheckValid && a.CountReplicas == 1 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadIntermediateMasterWithSingleReplica
				a.Description = "Intermediate master cannot be reached by orchestrator and its (single) replica is not replicating"
				//
			} else if !a.IsMaster && !a.LastCheckValid && a.CountReplicas > 1 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadIntermediateMaster
				a.Description = "Intermediate master cannot be reached by orchestrator and none of its replicas is replicating"
				//
			} else if !a.IsMaster && !a.LastCheckValid && a.CountValidReplicas < a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadIntermediateMasterAndSomeReplicas
				a.Description = "Intermediate master cannot be reached by orchestrator; some of its replicas are unreachable and none of its reachable replicas is replicating"
				//
			} else if !a.IsMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == 0 {
				a.Analysis = DeadIntermediateMasterAndReplicas
				a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are unreachable"
				//
			} else if !a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 {
				a.Analysis = UnreachableIntermediateMasterWithLaggingReplicas
				a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are lagging"
				//
			} else if !a.IsMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
				a.Analysis = UnreachableIntermediateMaster
				a.Description = "Intermediate master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue"
				//
			} else if !a.IsMaster && a.LastCheckValid && a.CountReplicas > 1 && a.CountValidReplicatingReplicas == 0 &&
				a.CountReplicasFailingToConnectToMaster > 0 && a.CountReplicasFailingToConnectToMaster == a.CountValidReplicas {
				// All replicas are either failing to connect to master (and at least one of these have to exist)
				// or completely dead.
				// Must have at least two replicas to reach such conclusion -- do note that the intermediate master is still
				// reachable to orchestrator, so we base our conclusion on replicas only at this point.
				a.Analysis = AllIntermediateMasterReplicasFailingToConnectOrDead
				a.Description = "Intermediate master is reachable but all of its replicas are failing to connect"
				//
			} else if !a.IsMaster && a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = AllIntermediateMasterReplicasNotReplicating
				a.Description = "Intermediate master is reachable but none of its replicas is replicating"
				//
			} else if a.IsBinlogServer && a.IsFailingToConnectToMaster {
				a.Analysis = BinlogServerFailingToConnectToMaster
				a.Description = "Binlog server is unable to connect to its master"
				//
			} else if a.ReplicationDepth == 1 && a.IsFailingToConnectToMaster {
				a.Analysis = FirstTierReplicaFailingToConnectToMaster
				a.Description = "1st tier replica (directly replicating from topology master) is unable to connect to the master"
				//
			}
			//		 else if a.IsMaster && a.CountReplicas == 0 {
			//			a.Analysis = MasterWithoutReplicas
			//			a.Description = "Master has no replicas"
			//		}

		} else /* Group replication issue detection */ {
			// Group member is not reachable, has replicas, and none of its reachable replicas can replicate from it
			if !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicatingReplicas == 0 {
				a.Analysis = DeadReplicationGroupMemberWithReplicas
				a.Description = "Group member is unreachable and all its reachable replicas are not replicating"
			}

		}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值