深入理解spark web ui

最新推荐文章于 2024-06-23 16:52:06 发布

cclovezbf

最新推荐文章于 2024-06-23 16:52:06 发布

阅读量2.1k

点赞数 3

分类专栏： spark 文章标签： spark invalid url

本文链接：https://blog.csdn.net/cclovezbf/article/details/121671103

版权

spark 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

背景某个表 1亿2千万数据

select substring(display_cluster_id,0,1) ,count(1)
from odsiadata.ia_fdw_model_result_for_batch_registration_detect_all
--where display_cluster_id='3_000000337'
group by substring(display_cluster_id,0,1)

这个substring 可以忽略简单理解为有个group by display_cluster_id

该表数据大小，存储格式为orc 总大小为3.7 G

job总概述页面

job 方面其实很简单唯一注意的是如果你在同一个session里先后提交了多个任务，这里就会有多条记录。待会截图

主要还是看stage和task

stage总概述页面

注意信息

1.可以看到map其实就是用的spark的mappartitions算子，reduce用了groupby，mapvalues mappartitions 算子

2.第一个stage是map 第二个是reducer

3.map的task数目=17 reduce的task数目=120

4. map的input=269.5 MB map-shuffle-write=5.5kb

5.reducer-output=10.4kb reducer-shuffle-read=5.5kb

这里有个 map-shuffle-write=reducer-shuffle-read=5.5kb 简单说下就是map之后的任务输出文件就会给reduce使用，所以这里是相等的

现在有如下问题

1.为什么是1个reduce 一个map？怎么不是2个3个

3.map的task数目怎么是17 为啥不是170？ reduce task数目=120 怎么不是121

4.map的input=269.5MB为啥不是2G map-shuffle-write怎么这么小？

5.reduce-output=10.4kb 为啥不是很大这10.4kb是什么?

带着一系列问题我们点进map的详细页面

注意以上内容

1.Input Size / Records: 269.5 MB / 122094 这个上文也提到了 269.5 MB

2.Shuffle Write: 5.5 KB / 94 上文也提到了5.5 KB

3.注意看dag图里17 =我们看到的task数目17

4.Executor ID 有1234 说明是4个excutor ，succeeded task=5+4+4+4=17个tasks

5.Input Size / Records 4条相加=269.5MB /122094 Shuffle Write Size / Records=5.5 KB / 94

其实这么看下越来越有感觉。就好像快看到真相了。

接着看

HIVE_RECORDS_IN=124969694

HIVE_DESERIALIZE_ERRORS=0 没有错呀

HIVE_RECORDS_OUT_0 map没有输出

单独拿一个excutor id= 1 说事。

index=0 代表是7号task 总共有17个task index会从0-16

id=7 暂时未知。。。

executor Id 代表的是第几个excutor 其中 1个excutor可以运行1-n个task

其实我目前来说就是在一直细分

job->map和reduce->map的 excutor->map的具体tasks

然后现在根据task反推stage

该map任务有124969694条数据需要读取 hdfs数据是3.7G的orc，一共分了4个excutor去干活

这个数据这么多怎么分呢把任务分成多少份呢？平均分成4份？可是有的excutor干的快有的干的慢，平均分了，万一有个摸批，磨洋工，那要花多少时间？

所以需要指定一种规则，根据文件大小去划分，怎么划分呢？这可难到我了

查看日志读取了那些文件

21/12/02 09:51:32 INFO rdd.HadoopRDD: Input split: Paths:/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__4a3602db_1259_40f5_9348_e53a7f0ca8fd:3221225472+158466895,/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__67521883_60a9_4551_ba1c_f7479faf6ed7:0+12058631InputFormatClass: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

21/12/02 09:51:34 INFO rdd.HadoopRDD: Input split: Paths:/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__4a3602db_1259_40f5_9348_e53a7f0ca8fd:2415919104+134217728,/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__4a3602db_1259_40f5_9348_e53a7f0ca8fd:2550136832+134217728InputFormatClass: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

21/12/02 09:51:34 INFO rdd.HadoopRDD: Input split: Paths:/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__173785be_79fd_408f_be3f_d6540399a070:0+134217728,/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__173785be_79fd_408f_be3f_d6540399a070:134217728+134217728InputFormatClass: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

21/12/02 09:51:34 INFO rdd.HadoopRDD: Input split: Paths:/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__c1e47e6e_8765_4c0b_9c26_73e434799347:0+175085967InputFormatClass: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

21/12/02 09:51:34 INFO rdd.HadoopRDD: Input split: Paths:/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__4a3602db_1259_40f5_9348_e53a7f0ca8fd:268435456+134217728,/user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__4a3602db_1259_40f5_9348_e53a7f0ca8fd:402653184+134217728InputFormatClass: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

点击excutor的日志可以看到如何划分的，每128M划分为一个块

某些刚好比128M多一点的167.0 M也当作一个块

其中excutor1 分了9个块那为啥是5个task呢？

其实是这样的有的一个task读取了两个块，可以看到上面总共是5条记录

注意下我们的文件

307.2 M  921.6 M  /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__173785be_79fd_408f_be3f_d6540399a070
1.4 K    4.2 K    /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__43d2a20a_b9e2_4182_8d0c_c8a9e0491f69
3.1 G    9.4 G    /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__4a3602db_1259_40f5_9348_e53a7f0ca8fd
11.5 M   34.5 M   /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__67521883_60a9_4551_ba1c_f7479faf6ed7
95.6 M   286.9 M  /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__7652d62b_4a74_4e95_b220_5f5ef9c8e50d
167.0 M  500.9 M  /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__c1e47e6e_8765_4c0b_9c26_73e434799347
7.7 M    23.0 M   /user/hive/warehouse/odsiadata.db/ia_fdw_model_result_for_batch_registration_detect_all/ia_fdw_model_result_for_batch_registration_detect_all__cd8a8198_5fe3_4fc5_b42d_7cdec99155d7

3.1G/128M=24.2个块

307/128M=2.39个块

95.6=1个块

167=1个块

差不多是28-30个块。

task1读取的是4a3602db_1259_40f5_9348_e53a7f0ca8fd:3221225472+158466895 =3G+151MB

67521883_60a9_4551_ba1c_f7479faf6ed7:0+12058631 =11.5M

task2读取的是4a3602db_1259_40f5_9348_e53a7f0ca8fd:2415919104+134217728第19个128M

4a3602db_1259_40f5_9348_e53a7f0ca8fd:2550136832+134217728第20个128M

task3读取的是173785be_79fd_408f_be3f_d6540399a070:0+134217728 第1个块128M

173785be_79fd_408f_be3f_d6540399a070:134217728+134217728 第2个块128M

task4读取的是4a3602db_1259_40f5_9348_e53a7f0ca8fd:268435456+134217728第3个块128M

4a3602db_1259_40f5_9348_e53a7f0ca8fd:402653184+134217728第4个块128M

task5读取的是c1e47e6e_8765_4c0b_9c26_73e434799347:0+175085967 第一个块167M

所以

task1:162.5M

task2:258M

task3:258M

task4:258M

task5:167M

每个都比128M多，同时也尽量按照了128M的块的规则去读取数据。。。有时间再研究下这个分片规则。

按照上面的截图该executor1处理了1700多w的数据，但是这个input size/records是什么？

以单个的task来看，他处理了481w数据

在看shuffle write=360.0 B / 6 是对的，因为我分组的字段最终就只有8个，所以该task最后输出结果不超过7是正确的，他输出的结果肯定是 1：100w 2：20w 3：30w。。。。

input size/records=12.3 MB / 4707 貌似无法理解，为什么就处理了4707条数据呢？

个人猜测虽然数据有480w+，但是执行shuffle的时候

--未完待续

cclovezbf

关注

3
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
深入理解spark web ui

背景某个表 1亿2千万数据select substring(display_cluster_id,0,1) ,count(1)from odsiadata.ia_fdw_model_result_for_batch_registration_detect_all --where display_cluster_id='3_000000337'group by substring(display_cluster_id,0,1)这个substring 可以忽略简单理解为有个group b...
复制链接

扫一扫

专栏目录