carbondata连接数优化

不吃饭的猪

已于 2024-06-13 16:13:07 修改

阅读量354

点赞数 3

文章标签：大数据 hadoop

于 2024-06-13 11:04:47 首次发布

本文链接：https://blog.csdn.net/weixin_51473488/article/details/139647784

版权

一，背景

  carbondata的入库采用arbonData Thrift Server方式提供，由于存在异常的入库segments但是显示状态是success，所以每天运行另一个博客中的脚本，出现连接超时，运行不正常，排查是每天连接数太多，每天将segments都遍历一遍。

二优化策略

a,策略一：
1，通过添加spark的调度池
在Spark中，调度池（Scheduler Pool）用于为不同的作业分配资源池，以控制其执行优先级。设置调度池可以帮助管理不同作业之间的资源争用情况。要使用调度池，您需要配置Fair Scheduler并创建相应的调度池配置文件。
1-1 设置调度池
spark.sql.hive.thriftServer.scheduler.pool=my-pool
1-2配置调度池文件
cp fairscheduler.xml.template fairscheduler.xml

 <pool name="my-pool">
       <schedulingMode>FAIR</schedulingMode>
               <weight>1</weight>
                       <minShare>3</minShare>
                               <maxRunningApps>50</maxRunningApps>
                                       <maxResources>100g,50</maxResources>
                                               <minResources>4g,8</minResources>
                                                       <fairSharePreemptionTimeout>300</fairSharePreemptionTimeout>
                                                               <minSharePreemptionTimeout>120</minSharePreemptionTimeout>
                                                                       <fairSharePreemptionThreshold>0.5</fairSharePreemptionThreshold>
                                                                           </pool>

2，启用异步模式，提搞并发能力
 spark.sql.hive.thriftServer.async = true 
3,spark-default中配置


```xml
spark.sql.hive.thriftServer.scheduler.pool=my-pool
spark.sql.hive.thriftServer.thrift.port=10000
spark.sql.hive.thriftServer.idleSessionTimeout=3600
spark.sql.hive.thriftServer.async=true

4，启动命令
     /bin/spark-submit --master yarn   --conf spark.driver.maxResultSize=20g --conf spark.sql.hive.thriftServer.scheduler.pool=my-pool  --conf spark.scheduler.mode=FAIR \
    --conf spark.scheduler.allocation.file=$SPARK_HOME/conf/fairscheduler.xml --conf spark.sql.shuffle.partition=50 --driver-memory 25g --executor-cores 4 --executor-memory 5G --num-executors 10 --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib/apache-carbondata-2.X-bin-sparkx-hadoop2.x.x.jar 
通过指定spark.sql.hive.thriftServer.scheduler.pool设置
5，验证
    通过查看是否 有create pool和 Removed from pool
b,策略二：
    可以尝试通过zk进行负载均衡，这样还待测试

不吃饭的猪

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
carbondata连接数优化

在Spark中，调度池（Scheduler Pool）用于为不同的作业分配资源池，以控制其执行优先级。设置调度池可以帮助管理不同作业之间的资源争用情况。要使用调度池，您需要配置Fair Scheduler并创建相应的调度池配置文件。1，通过添加spark的调度池。1-2配置调度池文件。
复制链接

扫一扫