生产环境踩坑系列::Hive on Spark的connection timeout 问题

本文详细记录了一次生产环境中遇到的Hive on Spark连接超时问题,从告警、集群状态、程序和开源工具等多个角度进行深入分析。经过排查,发现问题是由于Driver线程在反向连接HiveServer2时发生超时,超时时间默认为1s,过于短暂。解决方法是增加该连接的超时时间。同时,文中对Hive on Spark的架构和工作原理进行了简要介绍。
摘要由CSDN通过智能技术生成

起因

7/16凌晨,钉钉突然收到了一条告警,一个公司所有业务部门的组织架构表的ETL过程中,数据推送到DIM层的过程中出现异常,导致任务失败。

因为这个数据会影响到第二天所有大数据组对外的应用服务中组织架构基础数据,当然,我们的Pla-nB也不是吃素的,一旦出现错误,后面的权限管理模块与网关会自动配合切换前一天的最后一次成功处理到DIM中的组织架构数据,只会影响到在前一天做过组织架构变化的同事在系统上的操作,但是这个影响数量是可控的,并且我们会也有所有组织架构变化的审计数据,如果第二天这个推数的ETL修复不完的话,我们会手动按照审计数据对这些用户先进行操作,保证线上的稳定性。

技术架构

  • 集群:CDH 256G/64C计算物理集群 X 18台
  • 调度:dolphin
  • 数据抽取:datax
  • DIM层数据库:Doris
  • Hive版本:2.1.1

告警

告警策略现在是有机器人去捕捉dolphin的告警邮件,发到钉钉群里,dolphin其实是可以获取到异常的,需要进行一系列的开发,但是担心复杂的调度过程会有任务监控的遗漏,导致告警丢失,这样就是大问题,所以简单粗暴,机器人代替人来读取邮件并发送告警到钉钉,这样只关注这个幸福来敲门的小可爱即可。

在这里插入图片描述

集群log

Log Type: stderr

Log Upload Time: Fri Jul 16 01:27:46 +0800 2021

Log Length: 10569

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data7/yarn/nm/usercache/dolphinscheduler/filecache/8096/__spark_libs__6065796770539359217.zip/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for TERM
21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for HUP
21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for INT
21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls to: yarn,dolphinscheduler
21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls to: yarn,dolphinscheduler
21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls groups to: 
21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls groups to: 
21/07/16 01:27:43 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, dolphinscheduler); groups with view permissions: Set(); users  with modify permissions: Set(yarn, dolphinscheduler); groups with modify permissions: Set()
21/07/16 01:27:43 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1625364172078_3093_000001
21/07/16 01:27:43 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...
21/07/16 01:27:43 INFO client.RemoteDriver: Connecting to HiveServer2 address: hadoop-task-1.bigdata.xx.com:24173
21/07/16 01:27:44 INFO conf.HiveConf: Found configuration file file:/data8/yarn/nm/usercache/dolphinscheduler/filecache/8097/__spark_conf__.zip/__hadoop_conf__/hive-site.xml
21/07/16 01:27:44 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.XX.com/10.25.15.104:24173
	at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)
	at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:155)
	at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
	... 10 more
21/07/16 01:27:44 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173
	at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)
	at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:155)
	at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:<
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值