sparkThriftserver driver端数据过大挂掉的问题spark.sql.thriftServer.incrementalCollects

最新推荐文章于 2023-03-15 16:27:00 发布

早点起床晒太阳

最新推荐文章于 2023-03-15 16:27:00 发布

阅读量1.9k

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/zeng6325998/article/details/105584847

版权

spark 专栏收录该内容

26 篇文章 1 订阅

订阅专栏

参考资料:https://github.com/apache/spark/pull/22219
https://forums.databricks.com/questions/344/how-does-the-jdbc-odbc-thrift-server-stream-query.html

背景

我们是使用sparkThriftserver作为BI的底层作为查询，当我们的平台多人操作查询的时候，有时候就会无端卡死报错。
端口还在但是访问不了。

解决

由于是driver端挂掉的问题，我们使用jmap先来分析堆内存的问题。jmap -heap pid 发现堆中的old gen都被占满了。
然后gc也清除不了数据，所以就造成了这么一个情况。

原来我们这边的BI 有列分析查询，会把所有的数据查询返回driver 端，然后进行分析，这样多人操作就会造成driver端数据量过大。我们设置
–driver-memory=10G spark.driver.maxResultSize=10G 发现这也是不够的。所以为了不使driver端OOM，
设置此参数

spark.sql.thriftServer.incrementalCollects 为 true

这个参数的含义我在查询这篇文档找到了理想的回答

Currently, the JDBC ODBC Thrift Server shares a cluster-global HiveContext (special type of SqlContext).
When you run a query through the Thrift Server, the results are returned in 1 of 2 ways depending on the spark.sql.thriftServer.incrementalCollect flag (true/false):

false: the Driver calls collect() to retrieve all partitions from all Worker nodes. The Driver sends the data back to the client.
This option retrieves from all partitions in parallel and will therefore be faster, overall.
However, this may not be an option as the Driver currently has a limit of 4GB when calling collect(). Calling collect() on results over 4GB will not work.
true: The Driver calls foreachPartition() to incrementally and sequentially retrieve each partition from each Worker. As each partition is retrieved, the Driver sends it back to the client. This option handles result sets > 4GB, but will likely be slower, overall.

当这个值设置为true的时候，driver端将会顺序检索每个分区，并将它发送到客户端。所以driver端的内存峰值会下降，不会是所有的数据的值，而是最大也就一个最大分区的大小。

这样的话,就无须担心大量的输出了。他是将各个分区依次拉取的，而不是一次性把所有数据都拉过来。