数据库巡检检验哪些实例已经不在使用的方案

qq_35640866

已于 2024-01-16 16:49:44 修改

阅读量370

点赞数 7

文章标签：数据库

于 2023-12-19 17:01:08 首次发布

本文链接：https://blog.csdn.net/qq_35640866/article/details/134803079

版权

背景
不清楚线上是否有已经不在使用的实例，没有退掉。

方案
使用阿里云监控api 每隔2个小时取一次上两个小时监控最大读qps，最大写qps，和最大链接数等。然后分析最近三天的最大值。通过判断最值来做第一次筛选，然后人工判断。需要排查监控数据的干扰。

结果
发现已经不在用的数据库，为公司每月节省固定成本1.9万

不同数据库产品判断方法：

rds判断方法
rds 获取的监控指标：

 ["CpuUsage", "IOPSUsage", "DiskUsage", "MySQL_ComDelete", "MySQL_ComInsert",
  "MySQL_ComInsertSelect", "MySQL_ComReplace","MySQL_ComSelect", "MySQL_ComUpdate", "DASMySQLSessionCount"]

DASMySQLSessionCount是通过das服务api接口获得
das20200116_models.GetMySQLAllSessionAsyncRequest
其它监控指标是通过云监控 cms DescribeMetricDataRequest接口获得。

判断rds已经不在使用的指标三天最大写qps 小于1，最大读qps小于6 或者最大长链接session 等于0，因为自身监控的原因，就是没用也要少量的qps

where
(
a.max_write_max < 1
and a.max_select_max < 6
)
or a.max_das_total_session_count = 0

判断sql

select
  b.*
from
  (
    select
      inst_id,
      `inst_name`,
      max_write_max,
      max_select_max,
      max_das_total_session_count,
      max_cpu_max,
      max_iops_max
    from
      (
        SELECT
          inst_id,
          `inst_name`,
          max(write_max) AS max_write_max,
          max(select_max) as max_select_max,
          max(das_total_session_count) as max_das_total_session_count,
          max(`cpu_max`) as max_cpu_max,
          max(`iops_max`) as max_iops_max
        FROM
          `rds_monitor_data`
        where
          `exec_date` > date_sub(now(), interval 3 day)
        group by
          inst_id,
          inst_name
      ) a
    where
      a.max_write_max < 1
      and a.max_cpu_max < 1
      and a.max_iops_max < 1
      and a.max_das_total_session_count < 5
  ) b
  JOIN rds_info c
where
  b.inst_id = c.inst_id
  and c.`status` = 1
order by
  b.max_write_max

redis判断方法

获得监控指标

["CpuUsage", "memoryUsage", "TotalQps", "GetQps", "PutQps","intranetInRatio","intranetOutRatio","dasRedisTotalSession"]

dasRedisTotalSession 是通过 das 接口 das20200116_models.GetRedisAllSessionRequest获取。
其它指标是通过redis DescribeHistoryMonitorValuesRequest监控接口获取。

阿里云 redis判断实例不在使用的标准。
因为阿里云redis监控数据会产生部分qps，所以不能用qps等于0判断是否在用。

a.max_put_qps_max < 2 （主要是这个参数，只用这个参数判断也可以）
and a.max_get_qps_max < 25 （辅助条件）
and a.max_das_total_session_count < 30（辅助条件）

es 判断方法

获取的指标：

 ["NodeCPUUtilization", "NodeDiskUtilization", "NodeLoad_1m", "NodeHeapMemoryUtilization", "ClusterIndexQPS", "ClusterQueryQPS", "elasticsearch-server.bulk_total_operations", "elasticsearch-server.search_total"]

通过es grafan接口获得监控指标
client.get_emon_monitor_data_with_options 获取一段时间内对index 的读写请求。
排除 ‘.kibana|.report|.monitoring|.apm-|.security-’ 索引。取读写最大的索引。
指标：一个写操作，一个读操作。
“elasticsearch-server.bulk_total_operations”, “elasticsearch-server.search_total”
部分请求体

      request_body = """

              {{
                  "start":{pe_start_ts},
                  "queries":[
                      {{
                       
                          "metric":"{metric_key}",
                          "aggregator":"sum",
                          "downsample":"avg",
                          "tags":
                              {{
                                  "instanceId":"{instance_id}",
                                  "es_resourceUid":"1241148226163200",
                                  "index":"*"
                              }},
                              "granularity":"1m"
                      }}
                  ],
                  "limit":"",
                  "end":{pe_end_ts}
              }}
              """.format(
            pe_start_ts=pe_start_ts, pe_end_ts=pe_end_ts, instance_id=instance_id, metric_key=metric_key)

判断标准：
index_read_max_name，index_bulk_write_max_name 读写的索引名字为空（已经排除es自带’.kibana|.report|.monitoring|.apm-|.security-'）
max_cluster_index_qps_max，max_cluster_query_qps_max 监控指标没在用的时候也不会为0，因为有集群本身的监控，也会访问集群自带的index，产生请求。
加上这两个指标主要为了防止，有的阿里云集群见的太早，接口不支持获取
“elasticsearch-server.bulk_total_operations”, “elasticsearch-server.search_total” 监控指标。

a.max_cluster_index_qps_max < 20  （辅助判断）
and a.max_cluster_query_qps_max < 5（辅助判断）
and a.index_read_max_name = '' （如果能取得，api能获取这个值是最准确的判断）
and a.index_bulk_write_max_name = ''（如果能取得，api能获取这个值是最准确的判断）

Hbase 判断方法
通过cms_20190101_models.DescribeMetricListRequest cms监控api 获取监控指标
“cpu_idle”, “write_ops”, “read_ops”,“storage_used_percent”

判断标准：通过最近三天的qps和读qps最大值，小于10（防止监控数据干扰）做第一次筛选，然后人中再次判断。

a.max_write_ops_max < 10
and a.max_read_ops_max < 10

ClickHouse判断方法
通过cms_20190101_models.DescribeMetricListRequest cms监控api 获取监控指标

[“clickhouse_qps_cc”, “clickhouse_query_cc”, “clickhouse_tps_cc”,
“clickhouse_insert_rows_cc”, “clickhouse_conn_usage_count_cc”,
“clickhouse_disk_utilization_cc”, “clickhouse_cpu_utilization_cc”]

判断标准：通过最近三天的监控指标最大值，小于10（防止监控数据干扰）做第一次筛选，然后人中再次判断。

a.max_clickhouse_qps_cc_max < 10
and a.max_clickhouse_query_cc_max < 10
and a.max_clickhouse_tps_cc_max < 10
and a.max_clickhouse_insert_rows_cc_max < 10

adb的判断标准

通过cms_20190101_models.DescribeMetricListRequest cms监控api 获取监控指标

[“tps”, ‘qps’, ‘connections’, ‘worker_max_cpu_used’, ‘worker_max_node_disk_used_percent’]

判断标准：通过最近三天的监控指标最大值，小于5（防止监控数据干扰）做第一次筛选，然后人中再次判断。