Sparkthrift Server 启动命令调优及问题报错解决

1、sparkthrift Server 启动命令

因为是测试命令,所以你需要和正式服务进行区别,不改变节点的情况下需要改变服务名称和服务端口。

sbin/start-thriftserver.sh \
--name spark_sql_thriftserver2 \
--master yarn --deploy-mode client \
--driver-cores 4 --driver-memory 8g \
--executor-cores 4 --executor-memory 6g \
--conf spark.scheduler.mode=FAIR \
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--hiveconf hive.server2.thrift.bind.host=`hostname -i` \
--hiveconf hive.server2.thrift.port=10002

命令解析:

  • –name spark_sql_thriftserver2
    指定服务名称为 spark_sql_thriftserver2
    效果如图:
  • –master yarn --deploy-mode client
    指定 spark Job 提交的运行模式为 yarn-client。
    提交至 yarn 的运行模式有两种:yarn-client和yarn-cluster;yarn-client 的模式 driver 运行在本机,yarn-cluster 的模式 driver 提交给 yarn 进行指定,运行在 applicationMaster 所在节点。
    因为 sparkthrift Server 是交互式的任务,需要固定节点和端口,所以只能使用 yarn-client 模式。
  • –driver-cores 4 --driver-memory 8g
    指定主程序运行的环境配置,这里需要考虑本机的资源进行合理配置,因为 sparkthrift Server 是常驻任务,并且提供连接查询数据,所以考虑给大一点,cpu给4个,内存给8g。
  • –executor-cores 4 --executor-memory 6g
    指定启动每个 executor 需要分配的资源,同样需要考虑yarn每个节点的剩余资源进行合理分配;cpu给4个,内存给6g。
  • –conf spark.scheduler.mode=FAIR
    配置 spark 中 job 池的调度模式,该模式有两种:FIFO和FAIR。
    FIFO:每个job没有优先级,不做任务调度,顺序执行。
    FAIR:判断每个job的优先权重后,优先级高的,权重大的先调度执行。
    当然每个池子都可以配置不同的模式(,权重,最小的资源使用量),参考如图:
  • –conf spark.hadoop.fs.hdfs.impl.disable.cache=true
    禁止该任务使用cache缓存。
  • –conf spark.serializer=org.apache.spark.serializer.KryoSerializer
    指定序列化类为Kryo,这个参数非必要,我测试的时候发现默认的序列化类就是这个。
  • –hiveconf hive.server2.thrift.bind.host=hostname -i
    指定提供查询的server2服务的地址:即本机ip即可。(hostname=>本机别称,hostname -i=>本机ip)
  • –hiveconf hive.server2.thrift.port=10002
    指定提供查询的server2服务的端口为:10002。(注意查看一下该端口是否被其他服务占用,lsof -i=>查询所有端口使用情况,lsof -i:端口=>查询某个端口的使用情况)

2、实际生产过程中的报错解决

2.1、Kryo serialization failed: Buffer overflow. Available: 0, required: 2428400. To avoid this, increase spark.kryoserializer.buffer.max value

查询一张300多万条数据的表时(查表直接用的是 select * from tablename),缓冲一段时间后报错:

Caused by: org.apache.spark.SparkException: 
Kryo serialization failed: Buffer overflow. Available: 0, required: 2428400. To avoid this, increase spark.kryoserializer.buffer.max value. at 
	org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:350) at 
	org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393) 
	...

报错信息提示:序列化的缓冲内存不足,需要扩大缓存内存的容量。
这里应该时 driver 解析 task 发送过来的数据,task 任务会将数据序列化后进行网络传输,driver 接收到数据流后对其反序列化解析数据,才能将数据实际呈现,至于是 写buffer 还是 读buffer 的内存不足这里不做深入。
直接根据其提示,配置相应的参数:

--conf spark.kryoserializer.buffer=64m //这个指定序列化的默认缓冲容量(这个配置可有可无,不影响)
--conf spark.kryoserializer.buffer.max=1024m //指定序列化的最大缓冲容量
2.2、java.lang.OutOfMemoryError: GC overhead limit exceeded

衔接2.1,解决了序列化缓冲的问题,查询同一张表,缓冲了很长一段时间后,任务失败,连同 sparkthrift 任务一同挂掉了。
查询 sparkthrift 任务的执行日志可见主要报错信息:

23/03/14 11:18:29 WARN TransportChannelHandler: Exception in connection from /192.168.1.120:60792
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:207)
        at java.lang.StringBuilder.toString(StringBuilder.java:407)
        at java.lang.Class.getDeclaredMethod(Class.java:2130)
        at java.io.ObjectStreamClass.getInheritableMethod(ObjectStreamClass.java:1611)
        at java.io.ObjectStreamClass.access$2400(ObjectStreamClass.java:79)
        at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:531)
        at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:494)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2001)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1848)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2158)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1665)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2403)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2327)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2185)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1665)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:501)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:459)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
        at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$deserialize$2(NettyRpcEnv.scala:299)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$Lambda$835/115143278.apply(Unknown Source)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:352)
        at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$deserialize$1(NettyRpcEnv.scala:298)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$Lambda$834/1059022105.apply(Unknown Source)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:298)

有报错信息可以得知是 GC 超限了。
复现问题的时候,任务执行期间,用 top 命令查询 任务资源消耗情况。发现 sparkthrift 任务内存使用情况比较稳定,稳定维持在20%以下,但是cpu的使用率却高达1200%以上。
同时可见报错信息主要来自于 java.io.ObjectInputStream 这个类,该类应该是 driver 端对数据流进行反序列化解析读取数据时调用的。
个人推测应该是 driver 端解析数据时,内存使用率爆满,但是并没有多余闲置的内存可以释放,所以 cpu 反复 GC 无果,最后报错 GC 超限了。
所以根据推断,我调整了扩大了 driver 端内存的容量:

--driver-cores 4 --driver-memory 12g //这个扩大了driver的内存大小,同时可以考虑调大cpu的数量,毕竟测试的时候cpu使用率都达到1200%以上了
2.3、Job aborted due to stage failure: Total size of serialized results of 7 tasks (1084.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)

衔接2.2,GC超限问题解决后,查询同一张表,秒出报错信息:

SQL 错误: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 7 tasks (1084.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:361)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 7 tasks (1084.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2965)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2965)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:334)
	... 16 more

由报错信息可知:某个 task 返回给 driver 的结果集已经超出了默认的大小,那么直接调大结果集的最大容量即可。
一开始我扩大了一倍到2g,但是后来2g又不够了,再给3g,继续不够,得最终直接给到6g算了,反正资源充足。

--conf spark.driver.maxResultSize=6g //如果需要不限制结果集大小的话,直接该参数 =0 即可。

但是该参数应该并不适合无限制放大,其他博主的方案是要减小分区的数量(写spark程序时需要进行优化考虑),以减小最后 driver 端的内存压力。

2.4、java.lang.OutOfMemoryError: Java heap space

当我查询更大的表时(数据量7000W行),最终还是支持不住,报错了:

SQL 错误: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.OutOfMemoryError: Java heap space
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:361)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:373)
	at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:369)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:369)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollect$1(SparkPlan.scala:391)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollect$1$adapted(SparkPlan.scala:390)
	at org.apache.spark.sql.execution.SparkPlan$$Lambda$3241/463491470.apply(Unknown Source)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2965)
	at org.apache.spark.sql.Dataset$$Lambda$2425/999790851.apply(Unknown Source)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
	at org.apache.spark.sql.Dataset$$Lambda$1881/1956032123.apply(Unknown Source)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$$$Lambda$1889/493204045.apply(Unknown Source)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.execution.SQLExecution$$$Lambda$1882/1152849040.apply(Unknown Source)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2965)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:334)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3$$Lambda$2021/540239691.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)

这里总结一下,Spark 常见的两类 OOM 问题:Driver OOM 和 Executor OOM。
如果是 driver 端的 OOM,可以考虑减少分区的数量和扩大 driver 端的内存容量。
如果发生在 executor,可以通过增加分区数量,减少每个 executor 负载。但是此时,会增加 driver 的负载。所以,可能同时需要增加 driver 内存。定位问题时,一定要先判断是哪里出现了 OOM ,对症下药,才能事半功倍。

3、问题留言

最后,我怀疑,是不是 spark 不适合一次性返回这么大的数据量(select * from tablename 这种方式),毕竟这么大数据量都是要进行网络传输的,无论如何 driver 端的压力都会是巨大的。如果有大神知道,望留言解答,蟹蟹各位!

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值