问题现象
在集群中进行Hive-On-Spark查询失败,并在HiveServer2日志中显示如下错误:
[atguigu@hadoop102 bin]$ ods_to_dwd_log.sh 2020-06-15
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hbase-2.0.5/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 823e4c38-ee83-499a-8612-3e07b995cb3a
Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive Session ID = 4572b5bc-16d7-42e9-b3e0-a46deb6d1ba1
OK
Time taken: 0.736 seconds
FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session ed4ecba2-fbcd-4fed-a2a7-b6ab378b23cf
原因分析:
当Hive服务将Spark应用程序提交到集群时,在Hive Client会记录提交应用程序的等待时间,通过等待时长确定Spark作业是否在集群上运行。如果应用程序未在指定的等待时间范围内运行,则Hive服务会认为Spark应用程序已失败。
当Spark ApplicationMaster被分配了Yarn Container并且正在节点上运行时,则Hive认为Spark应用程序是成功运行的。如果Spark作业被提交到Yarn的排队队列并且正在排队,在Yarn为Spark作业分配到资源并且正在运行前(超过Hive的等待时长)则Hive服务可能会终止该查询并提示“Failed to create spark client”。
解决方案:
1.可以通过调整Hive On Spark超时值,通过设置更长的超时时间,允许Hive等待更长的时间以确保在集群上运行Spark作业,在执行查询前设置如下参数
set hive.spark.client.server.connect.timeout=300000;
2.该参数单位为毫秒,默认值为90秒。要验证配置是否生效,可以通过查看HiveServer2日志中查询失败异常日志确定:
总结:
1.当集群资源使用率过高时可能会导致Hive On Spark查询失败,因为Yarn无法启动Spark Client。
2.Hive在将Spark作业提交到集群是,默认会记录提交作业的等待时间,如果超过设置的hive.spark.client.server.connect.timeout的等待时间则会认为Spark作业启动失败,从而终止该查询。