Spark资源优化
提交Spark程序时,资源调优是必须的,否则会默认运行2个executor,每个executor内存1G。这里主要针对资源利用做个总结。
主要涉及
-
--num-executors
-
--executor-memory
-
--executor-cores
-
--conf spark.default.parallelism
服务器节点YARN可用资源
服务器 | 单节点cores | 总cores | 单节点内存 | 总内存 |
3台 | 29 | 87 | 18G | 54G |
提交任务时遇到的一个小问题
提交Spark时,发现无论怎么设置executor数都是2个,排查后才发现,CDH上设置了yarn.scheduler.minimum-allocation-mb的值为8G,也就是YARN的作业最小调度资源。节点上YARN可用内存为18G,所以最多只能开启2个Container,也就是2个executor,剩下的内存足够开启一个AM的Container,所以最后Container为3个,executor只有2个。
注意点
1. Spark的executor申请资源需要加上堆外内存
-
# 堆外内存5
g以下都是取384
m
-
spark
.executor
.memoryOverhead
max(384,
executorMemory * 0
.07)
-
-
spark
.executor
.memory 默认1
g
如果默认情况下,真正申请的资源应该是 1g+384m,但是设置了增量为512m,最后申请资源为 1.5g。
driver的内存设置也是同样。
2. 一个Container开启一个executor
3. dirver会占用一个Container,默认1个core,1G内存
也就是有一个节点会开启一个Container来运行driver,cluster模式AM运行在driver中,默认占用1个core,1G内存。这个节点的资源减少了,所以应该减掉一个executor。
4. 并行度为executor总cores的2-3倍
示例
因为内存比较少,所以先从内存来分配。
单节点18G可用内存,设置每个executor内存2G,那么实际申请资源为2.5G,可以开启 7.2个,也就是7个,总21个。
总87个core,每个executor分配4.14个core,也就是4个core。
考虑AM需要一个Container,这里可以减掉一个executor。
并行度设置为executor总core的2-3倍,即80的2-3倍,160。
-
--num-executors 20
-
--executor-memory 2g
-
--executor-cores 4
-
--conf spark.default.parallelism=160 \
运行spark-shell测试
-
spark-shell \
-
--master yarn \
-
--deploy-mode client \
-
--num-executors 20 \
-
--executor-cores 4 \
-
--executor-memory 2g
Container为21个,是20个executor和1个driver。
cores为81个,20个executor * 4为80,driver默认1个core。
内存52224,20个executor * 2.5 * 1024为51200,driver默认1G(1024m)。
这样CPU和内存使用率达到94.4%。 当然,还可以根据情况继续优化,尽量不要有闲置资源造成浪费。
例如Cluster模式时可以设置driver资源来分配给AM。
-
--master yarn
-
--deploy-mode clster
-
--driver-memory 2g
-
--driver-cores 4
这样基本达到完全使用。
或者开启对外内存,这是需要另外的分配空闲内存的。executors共享堆外内存,executor中的task共享executor的堆内内存。
-
--conf spark.memory.offHeap.enabled=true \
-
--conf spark.memory.offHeap.size=3072m \
这里的Storage Memory是存储内存,根据官网说明。
-
spark
.memory
.fraction 0
.6
-
#
Fraction
of (
heap
space
- 300
MB)
used
for
execution
and
storage.
-
-
spark
.memory
.storageFraction 0
.5
例如
当把Yarn节点内存调整到22G,总内存为66G,87cores
-
spark-shell \
-
--master yarn \
-
--deploy-mode client \
-
--num-executors 21 \
-
--executor-cores 4 \
-
--executor-memory 2560m
<div class="hide-article-box hide-article-pos text-center">
<a class="btn-readmore" data-report-click="{"mod":"popu_376","dest":"https://blog.csdn.net/lingeio/article/details/98973330","strategy":"readmore"}">
展开阅读全文
<svg class="icon chevrondown" aria-hidden="true">
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#csdnc-chevrondown"></use>
</svg>
</a>
</div>
<a href="https://download.csdn.net/download/szdxxhb/10600425" target="_blank">
<div class="content clearfix">
<div class="">
<h4 class="text-truncate oneline clearfix">
Hive on <em>Spark</em> 性能<em>优化</em> </h4>
<span class="data float-right">08-13</span>
</div>
<div class="desc oneline">
Hive on Spark是由Cloudera发起,由Intel、MapR等公司共同参与的开源项目,其目的是把Spark作为Hive的一个计算引擎,将Hive的查询作为Spark的任务提交到Spark集群上进行计算。通过该项目,可以提高H... </div>
<span class="type-show type-show-download">下载</span>
</div>
</a>
</div>
【零】SparkSQL特性与优化
07-27 阅读数 522
SparkSQL特性之:代码量少,可读性高。计算平均数的功能,左是hadoop写MapReduce的代码量,太繁琐。右是用SparkCoreRDDAPI写,代码量少但可读性不好。同样是计算平均数,用S... 博文 来自: Sid小杰的博客
spark 资源配置与性能
05-26 阅读数 578
在spark开发过程中,会遇到给中各样的问题。主要原因是对spark机制误解导致的。 学习spark之初,走了很多很多弯路。甚至一度对spark官方文档产生了怀疑,最终又回到s... 博文 来自: u010990043的博客
Spark性能优化指南——基础篇
04-29 阅读数 77
前言在大数据计算领域,Spark已经成为了越来越流行、越来越受欢迎的计算平台之一。Spark的功能涵盖了大数据领域的离线批处理、SQL类处理、流式/实时计算、机器学习、图计算等各种不同类型的计算操作,... 博文 来自: MeituanTech的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_59" data-pid="59" data-report-view="{"mod":"kp_popu_59-78","keyword":""}" data-report-click="{"mod":"kp_popu_59-78","keyword":""}"><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u3491668",
container: s
});
})();
Spark Streaming性能优化系列-如何获得和持续使用足够的集群计算资源?
06-17 阅读数 6491
一:数据峰值的巨大影响1.数据确实不稳定,例如晚上的时候访问流量特别大2.在处理的时候例如GC的时候耽误时间会产生delay延迟二:Backpressure:数据的反压机制基本思想:根据上一次计算的J... 博文 来自: snail_gesture的博客
SparkSQL性能分析与优化及相关工具小结
03-02 阅读数 6983
简介前段时间的工作是将内部一个OLAP系统Hxxx作为一个数据源接入到SparkSQL并进行优化。本文总结下调优过程当中一些可以借鉴与讨论的地方,鉴于本人水平有限,还请有这方面调优经验的同学不吝赐教^... 博文 来自: KISimple的专栏
SparkStreaming程序优化小记
06-14 阅读数 1151
最近公司部署了一个sparkstreaming程序,主要逻辑是处理flume采集到kafka的数据,集群环境3个nodemanager,5核20G内存,刚开始测试阶段并没设置资源配置,直接丢在yarn... 博文 来自: 香菇的博客
<div class="recommend-item-box recommend-box-ident recommend-download-box clearfix" data-report-view="{"mod":"popu_614","dest":"https://download.csdn.net/download/linke1183982890/10595440","strategy":"BlogCommendFromQuerySearch","index":"7"}" data-report-click="{"mod":"popu_614","dest":"https://download.csdn.net/download/linke1183982890/10595440","strategy":"BlogCommendFromQuerySearch","index":"7"}">
<a href="https://download.csdn.net/download/linke1183982890/10595440" target="_blank">
<div class="content clearfix">
<div class="">
<h4 class="text-truncate oneline clearfix">
<em>spark</em>性能<em>优化</em>小结 </h4>
<span class="data float-right">08-10</span>
</div>
<div class="desc oneline">
spark优化,spark优化,spark优化,spark优化,spark优化 </div>
<span class="type-show type-show-download">下载</span>
</div>
</a>
</div>
spark 资源优化之道
07-31 阅读数 38
在sparkjob执行中,我们通常会遇到这样那样的奇怪问题。该节主要介绍开发部署中遇到的一些资源分配问题。如:资源不足,资源分配过多,队列紧张等很多很多问题。在工程实践中,我们都希望自己的job能够以... 博文 来自: u014535908的博客
CDH环境Spark on Hue - L, there! - CSDN博客
8-10
archive/scala-2.10.3.tgz #tarxvzfscala-2.10.3.tgz-C /usr/l.....Spark资源优化 Spark性能测试Terasort HTTP状态码 MySQL查看数据库和表大小 ...
Spark性能测试Terasort - L, there! - CSDN博客
8-8
sparkbenchjar包资源 09-30 spark基本性能测试,结合binglia使用,挺好用的,欢迎大家下载。 下载 spark:使用Kryo优化序列化性能 02-21 阅读数 48 在Spark...
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_60" data-pid="60" data-report-view="{"mod":"kp_popu_60-43","keyword":""}" data-report-click="{"mod":"kp_popu_60-43","keyword":""}"><div class="mediav_ad"><newsfeed class="newsfeed QIHOO__WEB__SO__1565851183712_929" id="QIHOO__WEB__SO__1565851183712_929" style="display:block;margin:0;padding:0;border:none;width:900px;height:84px;overflow-y:hidden;overflow-x:hidden;position:relative;text-align:left;"><info-div id="QIHOO__WEB__SO__1565851183712_929-info" style="zoom:1"><info-div class="QIHOO__WEB__SO__1565851183712_929 singleImage clk" data-href="http://spro.so.com/searchthrow/api/midpage/throw?ls=sn2265522&q=%E8%85%BE%E8%AE%AF%E5%A4%A7%E7%8E%8B%E5%8D%A1%E5%8A%9E%E7%90%86%E5%AE%98%E7%BD%91&lmid=f8a8fa749ea1a49d.0&mid=72e6aea5a52be944ae10a2df583182a5&huid=11IQdAeeBBuve6zKKmm2o2%2BMNABDRCSaVgj4rL05LhU%2Bc%3D&lm_extend=ctype%3A22%7Clmbid%3A27%2C11%2C3%2C31%2C5%2C73%2C7%2C8%2C9%2C111%7Cjt%3A2%7Cmaxbid%3A4456453%2C4456457%2C4456480%2C4456712%2C4456962%2C4390928%2C4390947&ctype=22&rurl=https%3A%2F%2Fblog.csdn.net%2Flingeio%2Farticle%2Fdetails%2F98973330&bucket_id=27,11,3,31,5,73,7,8,9,111&lmsid=f8a8fa749ea1a49d.0&is_mpr=0" data-pv="https://stat.lianmeng.360.cn/s2/srp.gif?lm_extend=ctype%3A22%7Clmbid%3A27%2C11%2C3%2C31%2C5%2C73%2C7%2C8%2C9%2C111%7Cjt%3A2%7Cmaxbid%3A4456453%2C4456457%2C4456480%2C4456712%2C4456962%2C4390928%2C4390947&qid=f8a8fa749ea1a49d.0&nu=4&ls=sn2265522&ifr=0&ir=1&m=DwgKCA8KBwQJDgoBCgQJDXMNtk6xDqVGeK59nw&ds=1&wp=AAAAAF1U_jAAAAAAAAVwDlYZMD1pm83rHR09bw&_r=1565851184223,https://max-l.mediav.com/rtb?type=2&ver=1&v=CH8SEDEzODdiMzhjMzIyNzZjMDUYsqOKASCisEUoAWIXNTYzNzQyODY4NDUxNjA5MDgxMDAwMTiIAQA&k=20C6tAAAAAA=&w=AAAAAF1U_jAAAAAAAAVwYlr7T_FsFEuhuOx2gg&i=JuPBa4pm9Dgu&exp=BQBECQBEIABECAFEAgJEEABDIwBD&z=1" data-clk="https://stat.lianmeng.360.cn/s2/clk.gif?lm_extend=ctype%3A22%7Clmbid%3A27%2C11%2C3%2C31%2C5%2C73%2C7%2C8%2C9%2C111%7Cjt%3A2%7Cmaxbid%3A4456453%2C4456457%2C4456480%2C4456712%2C4456962%2C4390928%2C4390947&qid=f8a8fa749ea1a49d.0&nu=4&ls=sn2265522&ifr=0&ir=1&m=DwgKCA8KBwQJDgoBCgQJDXMNtk6xDqVGeK59nw&wp=AAAAAF1U_jAAAAAAAAVwDlYZMD1pm83rHR09bw&index=0&txt=%E8%85%BE%E8%AE%AF%E5%A4%A7%E7%8E%8B%E5%8D%A1%E5%8A%9E%E7%90%86%E5%AE%98%E7%BD%91&ds=%%DEAL_SLOT%%&_r=1565851184223,https://max-l.mediav.com/rtb?type=3&ver=1&v=CH8SEDEzODdiMzhjMzIyNzZjMDUYsqOKASCisEUoAWIXNTYzNzQyODY4NDUxNjA5MDgxMDAwMThwAA&k=yykMIQAAAAA=&i=JuPBa4pm9Dgu&exp=BQBECQBEIABECAFEAgJEEABDIwBD&x=__OFFSET_X__&y=__OFFSET_Y__&st=__EVENT_TIME_START__&et=__EVENT_TIME_END__&adw=__ADSPACE_W__&adh=__ADSPACE_H__&tc=&turl=">
<info-div class="wrap">
<info-div class="singleImage-img singleImage-img-left">
<info-div class="img" style="background-image:url(https://p3.ssl.qhimgs0.com/sdm/360_200_/t01e037bf4620a97dd1.jpg)"><info-div class="ads-tag"></info-div></info-div>
</info-div>
<info-div class="singleImage-body singleImage-body-left">
<info-div class="singleImage-title">腾讯大王卡怎么办理呢?</info-div>
<info-div class="singleImage-desc">大观</info-div>
</info-div>
<div class="recommend-item-box recommend-box-ident recommend-download-box clearfix" data-report-view="{"mod":"popu_614","dest":"https://download.csdn.net/download/linke1183982890/10595437","strategy":"BlogCommendFromQuerySearch","index":"9"}" data-report-click="{"mod":"popu_614","dest":"https://download.csdn.net/download/linke1183982890/10595437","strategy":"BlogCommendFromQuerySearch","index":"9"}">
<a href="https://download.csdn.net/download/linke1183982890/10595437" target="_blank">
<div class="content clearfix">
<div class="">
<h4 class="text-truncate oneline clearfix">
<em>spark</em>性能<em>优化</em>手册 </h4>
<span class="data float-right">08-10</span>
</div>
<div class="desc oneline">
spark优化,spark优化,spark优化,spark优化,spark优化 </div>
<span class="type-show type-show-download">下载</span>
</div>
</a>
</div><div class="recommend-item-box baiduSearch recommend-box-ident" data-report-view="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/95358747","strategy":"searchFromBaidu1","index":"4"}" data-report-click="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/95358747","strategy":"searchFromBaidu1","index":"4"}" data-track-view="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/95358747","strategy":"searchFromBaidu1","index":2,"extend1":"_"}" data-track-click="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/95358747","strategy":"searchFromBaidu1","index":2,"extend1":"_"}" data-flg="true"> <a href="https://blog.csdn.net/lingeio/article/details/95358747" target="_blank"> <h4 class="text-truncate oneline" style="width: 873px;">Scala自定义MEID效验工具类 - L, there! - CSDN博客</h4> <div class="info-box d-flex align-content-center"> <p> <span class="date">8-10</span> </p> </div> <p class="content oneline" style="width: 962px;"></p> </a> </div><div class="recommend-item-box baiduSearch recommend-box-ident" data-report-view="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/97341639","strategy":"searchFromBaidu1","index":"5"}" data-report-click="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/97341639","strategy":"searchFromBaidu1","index":"5"}" data-track-view="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/97341639","strategy":"searchFromBaidu1","index":3,"extend1":"_"}" data-track-click="{"mod":"popu_614","dest":"https://blog.csdn.net/lingeio/article/details/97341639","strategy":"searchFromBaidu1","index":3,"extend1":"_"}" data-flg="true"> <a href="https://blog.csdn.net/lingeio/article/details/97341639" target="_blank"> <h4 class="text-truncate oneline" style="width: 881px;">6. <em>Spark</em>源码解析之Worker实例化流程解析 - L, there! - CSDN博客</h4> <div class="info-box d-flex align-content-center"> <p> <span class="date">8-7</span> </p> </div> <p class="content oneline" style="width: 962px;">external shuffle服务主要是协调多个Worker共享Spark集群资源,后面再解读,这里是检查...Suricata”我们可以了解Suricata的安装部署大致框架、以及从配置方面谈及的性能优化...</p> </a> </div>
Spark提交命令和参数调优
05-08 阅读数 106
参数意义和参考值:1.num-executors线程数:一般设置在50-100之间,必须设置,不然默认启动的executor非常少,不能充分利用集群资源,运行速度慢2.executor-memory线... 博文 来自: 从一点一滴做起
<div class="recommend-item-box blog-expert-recommend-box" style="">
<div class="d-flex">
<div class="blog-expert-recommend">
<div class="blog-expert">
<div class="blog-expert-flexbox"></div>
</div>
</div>
</div>
</div><div class="recommend-item-box baiduSearch recommend-box-ident" data-report-view="{"mod":"popu_614","dest":"https://blog.csdn.net/zhuiqiuuuu/article/details/79290221","strategy":"searchFromBaidu1","index":"6"}" data-report-click="{"mod":"popu_614","dest":"https://blog.csdn.net/zhuiqiuuuu/article/details/79290221","strategy":"searchFromBaidu1","index":"6"}" data-track-view="{"mod":"popu_614","dest":"https://blog.csdn.net/zhuiqiuuuu/article/details/79290221","strategy":"searchFromBaidu1","index":4,"extend1":"_"}" data-track-click="{"mod":"popu_614","dest":"https://blog.csdn.net/zhuiqiuuuu/article/details/79290221","strategy":"searchFromBaidu1","index":4,"extend1":"_"}" data-flg="true"> <a href="https://blog.csdn.net/zhuiqiuuuu/article/details/79290221" target="_blank"> <h4 class="text-truncate oneline" style="width: 873px;"><em>spark</em> cache (几种缓存方法) - zhuiqiuuuu的博客 - CSDN博客</h4> <div class="info-box d-flex align-content-center"> <p> <span class="date">8-11</span> </p> </div> <p class="content oneline" style="width: 962px;"></p> </a> </div><div class="recommend-item-box baiduSearch recommend-box-ident" data-report-view="{"mod":"popu_614","dest":"https://blog.csdn.net/ainidong2005/article/details/53141822","strategy":"searchFromBaidu1","index":"7"}" data-report-click="{"mod":"popu_614","dest":"https://blog.csdn.net/ainidong2005/article/details/53141822","strategy":"searchFromBaidu1","index":"7"}" data-track-view="{"mod":"popu_614","dest":"https://blog.csdn.net/ainidong2005/article/details/53141822","strategy":"searchFromBaidu1","index":5,"extend1":"_"}" data-track-click="{"mod":"popu_614","dest":"https://blog.csdn.net/ainidong2005/article/details/53141822","strategy":"searchFromBaidu1","index":5,"extend1":"_"}" data-flg="true"> <a href="https://blog.csdn.net/ainidong2005/article/details/53141822" target="_blank"> <h4 class="text-truncate oneline" style="width: 873px;"><em>Spark</em> <em>优化</em> - 天道酬勤 - CSDN博客</h4> <div class="info-box d-flex align-content-center"> <p> <span class="date">5-13</span> </p> </div> <p class="content oneline" style="width: 962px;">那Spark不同程序的性能会碰到不同的资源瓶颈,比如:...and read them from there when they're needed. ...博文 来自: lxlmycsdnfree的博客 在运行时开启GC...</p> </a> </div>
SparkSQL性能调优
03-31 阅读数 7411
最近在学习spark时,觉得SparkSQL性能调优比较重要,所以自己写下来便于更过的博友查看,同时也希望大家给我指出我的问题和不足在spark中,SparkSQL性能调优只要是通过下面的一些选项进行... 博文 来自: YQlakers的博客
记录两次sparkjob优化,性能提升几十倍不止
09-04 阅读数 789
目前在做两个项目,一个搜索平台化一个排序服务化,在项目开发中两者都用到了spark开发数据处理,遇到问题多多,但解决后性能提升几十倍不止,下面记录下两次优化。一、在特征处理中,需要读取hive的数据进... 博文 来自: a925907195的专栏
Spark处理百亿规模数据优化实战
07-24 阅读数 4782
本优化是生产环境下用Spark处理百亿规模数据的一些优化实战,并成功将程序的速度提升一倍(涉及到敏感信息本文在2018-07-04号将其删除,阅读上可能显得不完整)下面介绍一些基本的优化手段本文于20... 博文 来自: aijiudu的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_61" data-pid="61" data-report-view="{"mod":"kp_popu_61-622","keyword":""}" data-report-click="{"mod":"kp_popu_61-622","keyword":""}"><div id="_g0clipcr7la" style=""><iframe width="852" frameborder="0" height="66" scrolling="no" src="https://pos.baidu.com/s?hei=66&wid=852&di=u3600846&ltu=https%3A%2F%2Fblog.csdn.net%2Flingeio%2Farticle%2Fdetails%2F98973330&psi=b5466e42907997516d9fd6f05c1bd01d&cdo=-1&cja=false&exps=111000,118009,110011&ps=6674x582&tlm=1565851183&dai=4&dri=0&ccd=24&chi=2&cce=true&tcn=1565851184&dc=3&cfv=0&dtm=HTML_POST&ari=2&cmi=34&cpl=17&pcs=1863x915&pis=-1x-1&pss=1863x8145&col=zh-CN&tpr=1565851183871&ti=Spark%E8%B5%84%E6%BA%90%E4%BC%98%E5%8C%96&prot=2&drs=1&ant=0&cec=UTF-8&par=1920x1040&psr=1920x1080&dis=0"></iframe></div><script type="text/javascript" src="//rabc1.iteye.com/common/web/production/79m9.js?f=aszggcwz"></script></div></div><div class="recommend-item-box baiduSearch recommend-box-ident" data-report-view="{"mod":"popu_614","dest":"https://blog.csdn.net/lh95lbw/article/details/81178162","strategy":"searchFromBaidu1","index":"10"}" data-report-click="{"mod":"popu_614","dest":"https://blog.csdn.net/lh95lbw/article/details/81178162","strategy":"searchFromBaidu1","index":"10"}" data-track-view="{"mod":"popu_614","dest":"https://blog.csdn.net/lh95lbw/article/details/81178162","strategy":"searchFromBaidu1","index":8,"extend1":"_"}" data-track-click="{"mod":"popu_614","dest":"https://blog.csdn.net/lh95lbw/article/details/81178162","strategy":"searchFromBaidu1","index":8,"extend1":"_"}" data-flg="true"> <a href="https://blog.csdn.net/lh95lbw/article/details/81178162" target="_blank"> <h4 class="text-truncate oneline" style="width: 881px;"><em>资源</em><em>优化</em> - lh95lbw的博客 - CSDN博客</h4> <div class="info-box d-flex align-content-center"> <p> <span class="date">7-6</span> </p> </div> <p class="content oneline" style="width: 962px;">文件和资源优化文件合并,文件最小化/文件压缩,使用CDN...(提高计算资源和存储资源);5、提高spark任务并发;6...博文 来自: There is a Bug!!! Android app性能...</p> </a> </div>
Spark学习(四)资源调度与任务调度的整合
11-15 阅读数 467
文章目录资源调度结论:1、默认情况下,每一个Worker会为当前的Application启动一个Executor进程,并且这个Executor会使用1G内存和当前Worker所能管理的所有core。2... 博文 来自: 滴水穿石的博客
Spark提交参数说明和常见优化
01-23 阅读数 1395
打开微信扫一扫,关注微信公众号【数据与算法联盟】转载请注明出处:http://blog.csdn.net/gamer_gyt博主微博:http://weibo.com/234654758Github:... 博文 来自: Thinkgamer博客
RDD Join 性能调优
01-12 阅读数 7427
阅读本篇博文时,请先理解RDD的描述及作业调度:[《深入理解Spark2.1Core(一):RDD的原理与源码分析》](http://blog.csdn.net/u011239443/article/... 博文 来自: Soul Joy Hub
<div class="recommend-item-box recommend-box-ident recommend-download-box clearfix" data-report-view="{"mod":"popu_614","dest":"https://download.csdn.net/download/qq_28626521/10863591","strategy":"BlogCommendFromQuerySearch","index":"18"}" data-report-click="{"mod":"popu_614","dest":"https://download.csdn.net/download/qq_28626521/10863591","strategy":"BlogCommendFromQuerySearch","index":"18"}">
<a href="https://download.csdn.net/download/qq_28626521/10863591" target="_blank">
<div class="content clearfix">
<div class="">
<h4 class="text-truncate oneline clearfix">
<em>Spark</em><em>优化</em>解析 </h4>
<span class="data float-right">12-20</span>
</div>
<div class="desc oneline">
数据倾斜优化,shuffle调优,运行资源调优,Spark企业应用案例。 </div>
<span class="type-show type-show-download">下载</span>
</div>
</a>
</div>
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_62" data-pid="62" data-report-view="{"mod":"kp_popu_62-623","keyword":""}" data-report-click="{"mod":"kp_popu_62-623","keyword":""}"><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u3600849",
container: s
});
})();
Spark Streaming(二十八)性能调优
10-01 阅读数 502
定义在SparkStreaming应用程序中有很多能够优化的地方,这样的优化可以提高应用的运行效率。减少批处理的时间SparkStreaming的优化可以大大提高每个批次的处理时间,每个批次处理其实就... 博文 来自: 李玉志的博客
【十六】SparkSQL常用性能优化
07-30 阅读数 2519
一、代码优化1.在数据统计的时候选择高性能算子。例如Dataframe使用foreachPartitions将数据写入数据库,不要每个record都去拿一次数据库连接。通常写法是每个partition... 博文 来自: Sid小杰的博客
spark性能优化:操作优化
05-21 阅读数 248
在大数据开发领域中,spark也成功受欢迎平台之一,我也基于spark开发过一些大数据计算作业,其中的调优过程也记录一下。1、对重复的RDD作缓存处理比如一个RDD多次使用那么应该对这个RDD作缓存处... 博文 来自: lh87270202的博客
大数据Spark “蘑菇云”行动第89课:Hive中GroupBy优化、Join的多种类型实战及性能优化、OrderBy和SortBy、UnionAll等实战和优化
11-30 阅读数 1780
大数据Spark“蘑菇云”行动第89课:Hive中GroupBy优化、Join的多种类型实战及性能优化、OrderBy和SortBy、UnionAll等实战和优化selectgender,sum(sa... 博文 来自: 段智华的博客
sparksql优化之路
01-14 阅读数 586
最近一直由于公司一个重要的作业,从Tez切换到sparksql,需要对sparksql进行优化。这个表都是leftjoin,慢就慢在join阶段Tez之前根据优化参数,执行时间在7分钟到12分钟之间浮... 博文 来自: 亮仔的专栏
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_63" data-pid="63" data-report-view="{"mod":"kp_popu_63-1405","keyword":""}" data-report-click="{"mod":"kp_popu_63-1405","keyword":""}"><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u4221910",
container: s
});
})();
Spark性能优化:Shuffle性能优化
09-18 阅读数 10
newSparkConf().set("spark.shuffle.consolidateFiles","true")spark.shuffle.consolidateFiles:是否开启shuffl... 博文 来自: weixin_33910759的博客
Spark DAG优化的解读
07-15 阅读数 694
一,Spark专业术语的解析1,Application基于Spark的用户程序,包含了driver程序和集群上的executor2,DriverProgram运行main函数并且新建SparkCont... 博文 来自: IT影风的博客
spark dataframe实战(持续更新)
12-19 阅读数 5598
sparkdataframe实战说明:该文基于spark-2.0+dataframe介绍dataframe是dataset的行的集合。Dataset是分布式数据集合。Dataset是Spark1.6+... 博文 来自: zg_hover的专栏
spark-on-yarn作业提交缓慢优化
09-30 阅读数 1559
sparkonyanr方式运行计算作业,发现作业提交缓慢根据日志,提交缓慢主要在两个过程:一、uploadingfile太慢17/05/0910:13:28INFOyarn.Client:Upload... 博文 来自: u010543388的博客
spark sql 性能优化
11-12 阅读数 4883
一设置shuffle的并行度我们可以通过属性spark.sql.shuffle.partitions设置shuffle并行度 二Hive数据仓库建设的时候,合理设置数据类型,比如你设置成INT的就不要... 博文 来自: happy19870612's blog
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_64" data-pid="64" data-report-view="{"mod":"kp_popu_64-1379","keyword":""}" data-report-click="{"mod":"kp_popu_64-1379","keyword":""}"><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u4221811",
container: s
});
})();
大数据之Spark性能优化
08-21 阅读数 219
Spark性能优化概述Spark是基于内存的大数据计算,需要进行性能优化原因是CPU、内存、网络带宽出现了瓶颈。如果网络传输和通信导致性能出现瓶颈,那么要加大网络带宽是必要的,如果内存导致性能出现瓶颈... 博文 来自: yanshien840826的博客
Spark优化操作_自定义groupby
12-22 阅读数 1280
groupby或者groupbyKey算子效率太低,自己重写了一下。//用combineByKey替代groupByvalhome_data_combine:RDD[(String,List[home... 博文 来自: willyan2007的博客
spark优化总结
08-27 阅读数 287
1、注意join的使用,如果有较小的表可考虑使用广播的方式实现mapjoin,类似MR/HIVE。广播变量是一个executor一份副本2、注意数据倾斜的问题,这个问题在分布式shuffle操作时都有... 博文 来自: 数据技术控
spark 笛卡尔积优化
04-08 阅读数 357
因业务需求,需对两份数据进行关联,然后进行计算,然后想到笛卡尔积。在最开始用spark对他进行处理的时候,他总是卡死在一个地方跑不出数据。需对其进行优化。1.任务代码data1=hc.sql("sel... 博文 来自: prometheus的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_65" data-pid="65" data-report-view="{"mod":"kp_popu_65-625","keyword":""}" data-report-click="{"mod":"kp_popu_65-625","keyword":""}"><div id="_3055udarkpk" style=""><span id="yuzwzdnrp" style="display:none;"></span><iframe width="852" frameborder="0" height="60" scrolling="no" src="https://pos.baidu.com/s?hei=60&wid=852&di=u3565460&ltu=https%3A%2F%2Fblog.csdn.net%2Flingeio%2Farticle%2Fdetails%2F98973330&psi=b5466e42907997516d9fd6f05c1bd01d&dis=0&cdo=-1&dtm=HTML_POST&col=zh-CN&cce=true&tpr=1565851183871&pss=1863x8726&cpl=17&cec=UTF-8&dai=6&drs=1&prot=2&ant=0&exps=111000,118009,110011&ti=Spark%E8%B5%84%E6%BA%90%E4%BC%98%E5%8C%96&dri=0&ari=2&psr=1920x1080&pcs=1863x915&ps=8673x582&cfv=0&tcn=1565851184&chi=2&cmi=34&tlm=1565851183&par=1920x1040&cja=false&dc=3&pis=-1x-1&ccd=24"></iframe></div><script type="text/javascript" src="//rabc1.iteye.com/common/openjs/m022.js?hcuzbzy=bi"></script></div></div>
spark中多表连接优化实例
07-13 阅读数 5741
环境信息:hive1.2.1spark1.6.1hadoop2.6.0-cdh5.4.2memory:1918752,vCores:506表结构:表名称表容量主键hive存储类型temp_01_pc_... 博文 来自: yijichangkong的专栏
spark从入门到放弃二十六:Spark 性能优化(9)reduceByKey和groupByKey
03-26 阅读数 3352
文章地址:http://www.haha174.top/article/details/259354举个例子valcounts=pairs.reduceByKey(_+_)valcounts=pair... 博文 来自: u012957549的博客
Spark优化操作_自定义distinct
12-22 阅读数 969
因为默认的distinct算子操作效率太低,自己改写一下。defmydistinct(iter:Iterator[(String,Int)]):Iterator[String]={iter.foldL... 博文 来自: willyan2007的博客
spark-大表join优化方案
06-09 阅读数 6835
数据量:1~2G左右的表与3~4T的大表进行Join拆分将任务数据分为多个结果RDD,将各个RDD的数据写入临时的hdfs目录,最后合并调整并行度和shuffle参数spark-submit参数#提高... 博文 来自: program哲学
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_66" data-pid="66" data-report-view="{"mod":"kp_popu_66-87","keyword":""}" data-report-click="{"mod":"kp_popu_66-87","keyword":""}"><div class="mediav_ad"><newsfeed class="newsfeed QIHOO__WEB__SO__1565851183977_441" id="QIHOO__WEB__SO__1565851183977_441" style="display:block;margin:0;padding:0;border:none;width:852px;height:60px;overflow-y:hidden;overflow-x:hidden;position:relative;text-align:left;"><info-div id="QIHOO__WEB__SO__1565851183977_441-info" style="zoom:1"><info-div class="QIHOO__WEB__SO__1565851183977_441 singleImage clk" data-href="http://spro.so.com/searchthrow/api/midpage/throw?ls=sn2265522&q=%E8%85%BE%E8%AE%AF%E5%A4%A7%E7%8E%8B%E5%8D%A1%E7%94%B3%E8%AF%B7%E5%85%A5%E5%8F%A3&lmid=f8a8fa749ea1a49d.1&mid=72e6aea5a52be944ae10a2df583182a5&huid=11IQdAeeBBuve6zKKmm2o2%2BMNABDRCSaVgj4rL05LhU%2Bc%3D&lm_extend=ctype%3A22%7Clmbid%3A27%2C11%2C3%2C31%2C5%2C73%2C7%2C8%2C9%2C111%7Cjt%3A2%7Cmaxbid%3A4456453%2C4456457%2C4456480%2C4456712%2C4456962%2C4390928%2C4390947&ctype=22&rurl=https%3A%2F%2Fblog.csdn.net%2Flingeio%2Farticle%2Fdetails%2F98973330&bucket_id=27,11,3,31,5,73,7,8,9,111&lmsid=f8a8fa749ea1a49d.1&is_mpr=0" data-pv="https://stat.lianmeng.360.cn/s2/srp.gif?lm_extend=ctype%3A22%7Clmbid%3A27%2C11%2C3%2C31%2C5%2C73%2C7%2C8%2C9%2C111%7Cjt%3A2%7Cmaxbid%3A4456453%2C4456457%2C4456480%2C4456712%2C4456962%2C4390928%2C4390947&qid=f8a8fa749ea1a49d.1&nu=4&ls=sn2265522&ifr=0&ir=1&m=DwgKCA8KBwQJDgoBCgQJDXMNtk6xDXgd5-aziA&ds=2&wp=AAAAAF1U_jAAAAAAAAVxA1pU1RYOkaUPQDjcLQ&_r=1565851184224,https://max-l.mediav.com/rtb?type=2&ver=1&v=CH8SEDEzODdiMzhjMzIyNzZjMDUYsqOKASCisEUoAmIXNTYzNzQyODY4NDUxNjA5MDgxMDAwMTiIAQA&k=CVuGAAAAAAA=&w=AAAAAF1U_jAAAAAAAAVxROF2TL_MD8BsKiz_Sw&i=JuBBa4pm9Dho&exp=BQBECQBEIABECAFEAgJEEABDIwBD&z=1" data-clk="https://stat.lianmeng.360.cn/s2/clk.gif?lm_extend=ctype%3A22%7Clmbid%3A27%2C11%2C3%2C31%2C5%2C73%2C7%2C8%2C9%2C111%7Cjt%3A2%7Cmaxbid%3A4456453%2C4456457%2C4456480%2C4456712%2C4456962%2C4390928%2C4390947&qid=f8a8fa749ea1a49d.1&nu=4&ls=sn2265522&ifr=0&ir=1&m=DwgKCA8KBwQJDgoBCgQJDXMNtk6xDXgd5-aziA&wp=AAAAAF1U_jAAAAAAAAVxA1pU1RYOkaUPQDjcLQ&index=1&txt=%E8%85%BE%E8%AE%AF%E5%A4%A7%E7%8E%8B%E5%8D%A1%E7%94%B3%E8%AF%B7%E5%85%A5%E5%8F%A3&ds=%%DEAL_SLOT%%&_r=1565851184224,https://max-l.mediav.com/rtb?type=3&ver=1&v=CH8SEDEzODdiMzhjMzIyNzZjMDUYsqOKASCisEUoAmIXNTYzNzQyODY4NDUxNjA5MDgxMDAwMThwAA&k=dgI5kwAAAAA=&i=JuBBa4pm9Dho&exp=BQBECQBEIABECAFEAgJEEABDIwBD&x=__OFFSET_X__&y=__OFFSET_Y__&st=__EVENT_TIME_START__&et=__EVENT_TIME_END__&adw=__ADSPACE_W__&adh=__ADSPACE_H__&tc=&turl=">
<info-div class="wrap">
<info-div class="singleImage-img singleImage-img-left">
<info-div class="img" style="background-image:url(https://p3.ssl.qhimgs0.com/sdm/360_200_/t01250938e7f0221f7c.jpg)"><info-div class="ads-tag"></info-div></info-div>
</info-div>
<info-div class="singleImage-body singleImage-body-left">
<info-div class="singleImage-title">腾讯大王卡怎么申请|腾讯大王卡官方申请办理入口流程 酷猴游戏</info-div>
<info-div class="singleImage-desc">大观</info-div>
</info-div>
经验|如何设置Spark资源
08-20 阅读数 2251
经常有人在微信群里问浪尖,到底应该如何配置yarn集群的资源,如何配置sparkexecutor数目,内存及cpu。今天浪尖在这里大致聊聊这几个问题。资源调优Spark...... 博文 来自: Spark高级玩法
Spark on YARN占用资源分析 - Spark 内存模型
04-17 阅读数 782
Spark的Excutor的Container内存有两大部分组成:堆外内存和Excutor内存A) 堆外内存(spark.yarn.executor.memoryOverhead) 主要用于JVM自... 博文 来自: wjl7813的博客
Spark离线计算优化——leftOuterJoin优化
06-06 阅读数 1264
两个k-v格式的RDD进行leftOuterJoin操作如果数据量较大复杂度较高的话计算可能会消耗大量时间。可以通过两种方式进行优化:1、leftOuterJoin操作前,两个RDD自身进行reduc... 博文 来自: zxlove
利用shell配置spark资源
05-30 阅读数 442
利用shell配置spark资源run.sh#!/bin/bashCUR=$(cd`dirname$0`;pwd)day=201801main(){localmonth=$1localprov_id=... 博文 来自: yiluohan0307的专栏
<div class="recommend-item-box recommend-box-ident recommend-download-box clearfix" data-report-view="{"mod":"popu_614","dest":"https://download.csdn.net/download/starryeyed/10251955","strategy":"BlogCommendFromQuerySearch","index":"43"}" data-report-click="{"mod":"popu_614","dest":"https://download.csdn.net/download/starryeyed/10251955","strategy":"BlogCommendFromQuerySearch","index":"43"}">
<a href="https://download.csdn.net/download/starryeyed/10251955" target="_blank">
<div class="content clearfix">
<div class="">
<h4 class="text-truncate oneline clearfix">
<em>spark</em> jdbc 读取并发<em>优化</em> </h4>
<span class="data float-right">02-14</span>
</div>
<div class="desc oneline">
spark scada jdbc连接数据库读取数据的并发优化方法。 </div>
<span class="type-show type-show-download">下载</span>
</div>
</a>
</div>
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_67" data-pid="67" data-report-view="{"mod":"kp_popu_67-658","keyword":""}" data-report-click="{"mod":"kp_popu_67-658","keyword":""}"><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u3573058",
container: s
});
})();
spark-yarn查看集群资源
04-15 阅读数 3273
spark-submit--masteryarn--deploy-modecluster--executor-cores4--num-executors3--executor-memory10... 博文 来自: 大白
Spark Streaming资源动态申请和动态控制消费速率原理剖析
05-31 阅读数 3309
为什么需要动态?a)Spark默认情况下粗粒度的,先分配好资源再计算。对于SparkStreaming而言有高峰值和低峰值,但是他们需要的资源是不一样的,如果按照高峰值的角度的话,就会有大量的资源浪费... 博文 来自: snail_gesture的博客
Spark性能优化之道——解决Spark数据倾斜
02-09 阅读数 640
本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitioner,使用Map侧Join代替Reduce侧Join,给倾斜Key加上... 博文 来自: 学海无涯
Spark的资源调度
11-09 阅读数 848
文章目录1、绪论2、前置知识3、流程图4、详细步骤1、绪论 上图是Spark程序运行时的一个超级简单的概括。我们运行一个Spark应用程序时,首先第一步肯定是写一个SparkApplicatio... 博文 来自: 身为风帆的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_68" data-pid="68" data-report-view="{"mod":"kp_popu_68-625","keyword":""}" data-report-click="{"mod":"kp_popu_68-625","keyword":""}"><div id="_u3dsy9343c" style=""><iframe width="852" frameborder="0" height="60" scrolling="no" src="//pos.baidu.com/s?hei=60&wid=852&di=u3565460&ltu=https%3A%2F%2Fblog.csdn.net%2Flingeio%2Farticle%2Fdetails%2F98973330&psi=b5466e42907997516d9fd6f05c1bd01d&psr=1920x1080&ti=Spark%E8%B5%84%E6%BA%90%E4%BC%98%E5%8C%96&cmi=34&dtm=HTML_POST&cja=false&ccd=24&col=zh-CN&ant=0&tpr=1565851183871&pcs=1863x915&pis=-1x-1&tcn=1565851184&cec=UTF-8&prot=2&dc=3&chi=2&dri=1&ari=2&cdo=-1&ps=10088x582&dai=8&par=1920x1040&cfv=0&tlm=1565851184&drs=1&exps=111000,118009,110011&cce=true&dis=0&pss=1863x10141&cpl=17"></iframe></div><script type="text/javascript" src="//rabc1.iteye.com/common/openjs/m022.js?hcuzbzy=bi"></script></div></div>
第145课: Spark面试经典系列之Yarn生产环境下资源不足问题、JVM和网络的经典问题详解
07-15 阅读数 1642
第145课:Spark面试经典系列之Yarn生产环境下资源不足问题、JVM和网络的经典问题详解 1Yarn生产环境下资源不足无法提交spark2yarn-client网络流量的问题 spar... 博文 来自: 段智华的博客
<div class="recommend-loading-box">
<img src="https://csdnimg.cn/release/phoenix/images/feedLoading.gif">
</div>
<div class="recommend-end-box" style="display: block;">
<p class="text-center">没有更多推荐了,<a href="https://blog.csdn.net/" class="c-blue c-blue-hover c-blue-focus">返回首页</a></p>
</div>
</div>
</main>