Spark On Yarn的机器出现IO过高的告警

最新推荐文章于 2025-04-24 11:24:13 发布

顧棟

最新推荐文章于 2025-04-24 11:24:13 发布

阅读量1k

点赞数 24

分类专栏： hadoop 文章标签：大数据 yarn spark

本文链接：https://blog.csdn.net/weixin_43820556/article/details/147138373

版权

hadoop 专栏收录该内容

25 篇文章

订阅专栏

Spark On Yarn模式下的yarn机器出现IO过高的告警

文章目录

Spark On Yarn模式下的yarn机器出现IO过高的告警

背景

对导致IO问题的任务的快速识别与预警

IO告警分类

集群式IO告警

集群中多台机器（超过10台？）在同一时间段发生IO告警，陆续还有其他机器发生IO告警，有雪崩趋势，需要立即找出问题任务，进行清理
单机式IO

在固定的几台上发生IO告警，影响范围可控，若长时间IO告警，也需要人工干预处理，防止影响雪崩。

主体流程

方式一：在Spark指标采集诊断任务处，增加对APPLICATION_WARN任务的处理分支，进行IO的诊断

方式二：查询Doris的metric_events_spark表，按照记录新增时间time，进行筛选，每次查询近6h内新增的appid。30min查询一次。再提供一个查询可疑app的查询接口？

yarn_app_io表字段（数据只保留7天）

id：主键
appid ：对应application_id
~~app_starttime：对应application的开始时间~~
~~app_name：对应application的appName~~
~~app_user：对应application的user~~
~~service：服务组件，spark,~~

当前，spark指标中IO可用指标，*Size属性的指标单位是 byte。

{
    ...
    "startTime": 1743906629181,
    "endTime": 1743934441762,
    "inOut": {
      "inputSize": 7212568933776,
      "inputRecords": 192354241079,
      "outputSize": 0,
      "outputRecords": 0,
      "shuffleReadSize": 0,
      "shuffleReadRecords": 0,
      "shuffleWriteSize": 69464878627819,
      "shuffleWriteRecords": 192354241078,
      "memorySpilledSize": 294483291799552,
      "diskSpilledSize": 67844450753852
    }
    ...
}

诊断规则

若shuffleWriteSize或diskSpilledSize的大小超过50TB，且运行时长超过30min，进行IO预警，预警等级严重
若shuffleWriteSize或diskSpilledSize的大小超过40TB且小于等于50GB，且运行时长超过30min，进行IO预警，预警等级警告
若shuffleWriteSize或diskSpilledSize的大小超过10TB且小于等于40GB，且运行时长超过60min，进行IO预警，预警等级普通？？？

统计方式

按shuffleWriteSize总量计算

分布情况
- 略
按shuffleWriteSize小时量计算

分布情况
- 略

sql

筛选可能出IO告警的任务（300w数据量耗时 22s+）

SELECT
  appid,
  from_unixtime(json_extract(description, '$.startTime') / 1000) as startTime,
  from_unixtime(json_extract(description, '$.endTime') / 1000) as endTime,
  json_extract(description, '$.summary.inOut.shuffleWriteSize') AS shuffleWriteSize,
  json_extract(description, '$.summary.inOut.memorySpilledSize') AS memorySpilledSize,
  json_extract(description, '$.summary.inOut.diskSpilledSize') AS diskSpilledSize
from
  metric_events_spark
where
  json_extract(description, '$.summary.inOut.shuffleWriteSize') / 10995116277760 > 1
  and (
    json_extract(description, '$.endTime') - json_extract(description, '$.startTime')
  ) / 1000 > 3600
  and time/1000 >= UNIX_TIMESTAMP() - 3600*6
order by
  time desc

  SELECT
  appid,
  from_unixtime(json_extract(description, '$.startTime') / 1000) as startTime,
  from_unixtime(json_extract(description, '$.endTime') / 1000) as endTime,
  json_extract(description, '$.summary.inOut.shuffleWriteSize') AS shuffleWriteSize,
  json_extract(description, '$.summary.inOut.memorySpilledSize') AS memorySpilledSize,
  json_extract(description, '$.summary.inOut.diskSpilledSize') AS diskSpilledSize
from
  metric_events_spark
where
  json_extract(description, '$.summary.inOut.shuffleWriteSize') / (
    (
      json_extract(description, '$.endTime') - json_extract(description, '$.startTime')
    ) / 1000 / 3600
  ) / 1099511627776 > 0.0000000001
  and (
    json_extract(description, '$.endTime') - json_extract(description, '$.startTime')
  ) / 1000 > 3600
  and time / 1000 >= UNIX_TIMESTAMP() - 3600 * 6
order by
  time desc;

查询给定appid中使用的机器

select
  NODEHOSTNAME
from
  yarn_rm_audit
where
  appid = 'application_1732696647414_47711634'
  and NODEHOSTNAME is not null
group by
  NODEHOSTNAME
order by
  NODEHOSTNAME

查询给定appid的IO相关指标

SELECT
  appid,
  json_extract(description, '$.summary.inOut') AS shuffleWriteSize
from
  metric_events_spark
where
  appid = 'application_1732696647414_47711634'

预警信息范例

单条

Yarn服务IO预警
告警时间: 2025-04-16 14:37:50
环境: sit
级别: 警告
事件标识: yarn-io-warn
事件数量: 1
告警内容:
  [2025-04-16 14:37:42] 共有3个任务读写量过大，可能触发IO异常。任务列表：
application_1729236716087_248111
application_1729236716087_248107
...

多条

Yarn服务IO预警
告警时间: 2025-04-16 15:07:50
环境: sit
级别: 警告
事件标识: yarn-io-warn
事件数量: 3
告警内容:
  最近:[2025-04-16 15:00:00] 共有5个任务读写量过大，可能触发IO异常。任务列表：
application_1729236716087_248111
application_1729236716087_248107
application_1729236716087_248112
...
  最早:[2025-04-16 14:40:00] 共有3个任务读写量过大，可能触发IO异常。任务列表：
application_1729236716087_248111
application_1729236716087_248107
...

遗留问题

出现IO告警，但是通过诊断规则没有匹配出来。需要对任务情况进行记录，详细指标情况。
诊断规则匹配出来的任务，没有出现IO告警。（参考指标不够，误报率？？？）

是不是跟集群规模，机器规格有关系？？？任务密集度？

告警信息
- 是否需要展示全部application_id
- 当某个application_id开始出现问题后，event会持续输出该消息
  - 对yarn-service-io进行聚合，还是对application_id进行聚合？
  - 输出到华佗web的告警信息会出现激增的情况，这样mysql数据库性能与存储的影响是否可以接受？

其他

如何找到了IO告警高的相关文件？

参考【ES实战】ES集群机器磁盘IO过高告警分析，同理，通过文件情况可以找到对应的application id。

优化方向

采用remote shuffle service

磁盘IO高带来的影响

性能下降

当磁盘IO高的时候，会导致系统响应时间变长，处理速度变慢，影响业务使用。
系统稳定性下降

长期的磁盘IO过多，可能导致系统的崩溃，应用服务中断，影响业务的正常使用。
资源竞争加剧

会导致CPU，内存的使用竞争加剧，使得服务器的整体负载加剧，导致服务器的不稳定性增加。
加剧硬件磨损

加速硬件磨损，缩短设备使用年限，增加硬件故障率。

磁盘IO的核心指标

数据传输率（吞吐率）

跟硬盘的接口类型有关，不用的接口类型有不同的传输带宽，例如：SATA 3.0传输带宽是6Gb/s。
响应时间

发起一个写请求/读请求，直到该请求返回的时间。
IOPS

每秒输入输出的次数
- Total IOPS：混合读写、顺序、随机I/O负载情况下的磁盘IOPS
- Random Read IOPS：100%随机读的IOPS
- Random Write IOPS：100%随机写的IOPS
- Sequential Read IOPS：100%顺序写的IOPS
- Sequential Read IOPS：100%顺序读的IOPS