负载过高问题快速排查思路教程二

feidodo网络

已于 2022-11-03 11:48:24 修改

阅读量219

点赞数

文章标签：数据库 sql java

于 2022-11-03 11:44:57 首次发布

本文链接：https://blog.csdn.net/feidododekefu/article/details/127667293

版权

由于之前查询发现有部分 left JOIN 的查询，故查询了 “异常容器” 的 left JOIN 的查询，，随后将读取条数为 0 的排除掉。

select * from (select LEFT(query, 100) as sql, count() as quneryNum, sum(query_duration_ms) as totalTime, totalTime/queryNum as avgTime from system.query_log ql where sql like '%异常容器%' and read_rows != 0 and event_time > toDateTime('2022-09-23 12:00:00') and event_time < toDateTime('2022-09-23 17:00:00') and query not like '%INSERT INTO%' group by sql order by queryNum desc)

根据小时 聚合 每个小时 / 分钟段查询次数耗时

--按照每小时聚合
select toHour(event_time) as t, count() as queryNum, sum(query_duration_ms) as totalTime, totalTime/queryNum as avgTime from system.query_log ql where event_time > toDateTime('2022-09-23 08:00:00') and event_time < toDateTime('2022-09-23 17:00:00') and query not like '%INSERT INTO%' and query like '%异常容器%' and read_rows != 0 group by t limit 50
--按照每分钟聚合
select toMinute(event_time) as t, count() as queryNum, sum(query_duration_ms) as totalTime, totalTime/queryNum as avgTime from system.query_log ql where event_time > toDateTime('2022-09-23 12:00:00') and event_time < toDateTime('2022-09-23 13:00:00') and query not like '%INSERT INTO%' and query like '%异常容器%' and read_rows != 0 group by t limit 50

查询结果如下：10 点开始有，12 点查询响应时间翻了 10 倍。

从 32s 开始，响应时间变为 24s 多，后面的时间也逐渐变长。

查询 left JOIN 的查询个数。

select * from (select LEFT(query, 100) as sql, count() as quneryNum, sum(query_duration_ms) as totalTime, totalTime/queryNum as avgTime from system.query_log ql where query like '% JOIN%' and read_rows != 0 and event_time > toDateTime('2022-09-23 12:00:00') and event_time < toDateTime('2022-09-23 21:00:00') and query not like '%INSERT INTO%' group by sql order by queryNum desc)

发现有问题的表时，查询该表结构。

show create table "shard_2"."cw_db_zabbix_metric_data01_replica"

经查询发现该表创建的比较差，order by 仅有一个 UUID。且该表 JOIN 的 SQL 也比较多，每次查询的读取的条数也特别大。

停掉 SQL ，观察结果

通过观察，发现有不少类似的 JOIN 的 SQL。当停掉一个异常容器指标的 SQL 时会发现 load 值有些许变化，偶尔会降到 700 多。由于指标管理都是通过 gatewayAPI 调用，停掉 SQL 最快的方法便是直接将 gatewayAPI 停掉，这样只会影响指标 SQL 调，不会影响其他功能使用。将 gatewayAPI 停掉后就不会出现指标调用 SQL 也全部停掉，此时发现 load 值已降低

总结

遇到此类问题可先查看日志，首先在（Clickhouse 日志 Zookeeper 日志）日志中看能否找到有用的信息，例如直接报错信息等，如果在日志中找不到太多有用的信息的话，可以从下面入手。

一般遇到 load 值比较高的情况时，基本上都是发生在查询上面。当遇到这种问题时可先查询带有 JOIN 的 SQL 语句是不是很多。随后通过 query_log 表中的一些字段去查询重要信息，通过对字段使用一些函数来获取有用的信息，如获取查询的 SQL，SQL 执行次数、执行时间等。

本次排查过程主要使用 query_log 表，下面为重要字段：

event_time — 查询开始时间.

query_duration_ms — 查询消耗的时间（毫秒）.

read_rows— 从参与了查询的所有表和表函数读取的总行数.

query — 查询语句.

Clickhouse query_log 表中所有字段

type (Enum8) — 执行查询时的事件类型。值:
- 'QueryStart' = 1 — 查询成功启动.
- 'QueryFinish' = 2 — 查询成功完成.
- 'ExceptionBeforeStart' = 3 — 查询执行前有异常.
- 'ExceptionWhileProcessing' = 4 — 查询执行期间有异常.
event_date (Date) — 查询开始日期.
event_time (DateTime) — 查询开始时间.
event_time_microseconds (DateTime64) — 查询开始时间（毫秒精度）.
query_start_time (DateTime) — 查询执行的开始时间.
query_start_time_microseconds (DateTime64) — 查询执行的开始时间（毫秒精度）.
query_duration_ms (UInt64) — 查询消耗的时间（毫秒）.
read_rows (UInt64) — 从参与了查询的所有表和表函数读取的总行数。包括：普通的子查询，IN 和 **JOIN** 的子查询。对于分布式查询 read_rows 包括在所有副本上读取的行总数。每个副本发送它的 read_rows 值，并且查询的服务器 - 发起方汇总所有接收到的和本地的值。缓存卷不会影响此值。
read_bytes (UInt64) — 从参与了查询的所有表和表函数读取的总字节数。包括：普通的子查询，IN 和 **JOIN** 的子查询。对于分布式查询 read_bytes 包括在所有副本上读取的字节总数。每个副本发送它的 read_bytes 值，并且查询的服务器 - 发起方汇总所有接收到的和本地的值。缓存卷不会影响此值。
written_rows (UInt64) — 对于 INSERT 查询，为写入的行数。对于其他查询，值为 0。
written_bytes (UInt64) — 对于 INSERT 查询时，为写入的字节数。对于其他查询，值为 0。
result_rows (UInt64) — SELECT 查询结果的行数，或 **INSERT** 的行数。
result_bytes (UInt64) — 存储查询结果的 RAM 量.
memory_usage (UInt64) — 查询使用的内存.
query (String) — 查询语句.
exception (String) — 异常信息.
exception_code (Int32) — 异常码.
stack_trace (String) — Stack Trace. 如果查询成功完成，则为空字符串。
is_initial_query (UInt8) — 查询类型。可能的值:
- 1 — 客户端发起的查询.
- 0 — 由另一个查询发起的，作为分布式查询的一部分.
user (String) — 发起查询的用户.
query_id (String) — 查询 ID.
address (IPv6) — 发起查询的客户端 IP 地址.
port (UInt16) — 发起查询的客户端端口.
initial_user (String) — 初始查询的用户名（用于分布式查询执行）.
initial_query_id (String) — 运行初始查询的 ID（用于分布式查询执行）.
initial_address (IPv6) — 运行父查询的 IP 地址.
initial_port (UInt16) — 发起父查询的客户端端口.
interface (UInt8) — 发起查询的接口。可能的值:
- 1 — TCP.
- 2 — HTTP.
os_user (String) — 运行 clickhouse-client 的操作系统用户名.
client_hostname (String) — 运行 clickhouse-client 或其他 TCP 客户端的机器的主机名。
client_name (String) — clickhouse-client 或其他 TCP 客户端的名称。
client_revision (UInt32) — clickhouse-client 或其他 TCP 客户端的 Revision。
client_version_major (UInt32) — clickhouse-client 或其他 TCP 客户端的 Major version。
client_version_minor (UInt32) — clickhouse-client 或其他 TCP 客户端的 Minor version。
client_version_patch (UInt32) — clickhouse-client 或其他 TCP 客户端的 Patch component。
http_method (UInt8) — 发起查询的 HTTP 方法。可能值:
- 0 — TCP 接口的查询.
- 1 — GET
- 2 — POST
http_user_agent (String) — The UserAgent The UserAgent header passed in the HTTP request。
quota_key (String) — 在 quotas 配置里设置的 “quota key” （见 keyed).
revision (UInt32) — ClickHouse revision.
ProfileEvents (Map(String, UInt64))) — Counters that measure different metrics. The description of them could be found in the table 系统。活动
Settings (Map(String, String)) — Names of settings that were changed when the client ran the query. To enable logging changes to settings, set the log_query_settings 参数为 1。
thread_ids (Array(UInt64)) — 参与查询的线程数.
Settings.Names (Array（String)) — 客户端运行查询时更改的设置的名称。要启用对设置的日志记录更改，请将 log_query_settings 参数设置为 1。
Settings.Values (Array（String)) — Settings.Names 列中列出的设置的值