一、YARN的故障
即使完美的软件也会有故障, YARN 是为了减少停机时间,而不是组件故障。
二、 YARN Failure Monitoring Communication
下图显示YARN中故障监控时, 各组件的通信来确保都存活的, 在故障发生时, 每个组件都有中重启机制。
三、修改Ambari中的故障检测行为
四、ResourceManager的设置检查
为了检查各组件是否存活, 定期巡检, 并处理故障组件。
五、NodeManager的检查设置
六、Container / Task and ApplicationMaster 的恢复
七、NodeManager and ResourceManager 的恢复
八、YARN Work-Preserving Restarts
YARN Work-Preserving Restarts 相关配置:
九、YARN Log Aggregation
- Enabled by default in HDP 2.3
- Enables long-term
- storage of NodeManager logs by storing them in a central location in HDFS
-Avoids the need to truncate logs in order to conserve space on a local file system
-Provides ability to centrally view log files via a single web UI (the Job History Server)
YARN Log Aggregation 默认配置: