1 问题
任务执行缓慢。卡顿日志如下:
2021-07-01 10:19:04 log INFO - Starting Job = job_1625097332819_16902, Tracking URL = http://c2-xxx-hadoopxxx.bj:8088/proxy/application_xxxxxxxx_16902/
2021-07-01 10:19:04 log INFO - Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job -kill job_1625097332819_16902
2021-07-01 10:19:24 log INFO - Hadoop job information for Stage-1: number of mappers: 810; number of reducers: 200
2021-07-01 10:19:25 log INFO - 2021-07-01 10:19:25,490 Stage-1 map = 0%, reduce = 0%
2021-07-01 10:19:48 log INFO - 2021-07-01 10:19:48,356 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 83.01 sec
2021-07-01 10:19:49 log INFO - 2021-07-01 10:19:49,402 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 174.33 sec
2021-07-01 10:19:50 log INFO - 2021-07-01 10:19:50,451 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 221.48 sec
2021-07-01 10:19:51 log INFO - 2021-07-01 10:19:51,488 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 335.56 sec
2021-07-01 10:19:52 log INFO - 2021-07-01 10:19:52,527 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 531.72 sec
2021-07-01 10:19:53 log INFO - 2021-07-01 10:19:53,582 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 809.4 sec
2021-07-01 10:19:54 log INFO - 2021-07-01 10:19:54,651 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 1198.47 sec
2021-07-01 10:19:55 log INFO - 2021-07-01 10:19:55,691 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 1534.1 sec
2021-07-01 10:19:56 log INFO - 2021-07-01 10:19:56,733 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 1762.57 sec
2021-07-01 10:19:57 log INFO - 2021-07-01 10:19:57,777 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 2180.07 sec
2021-07-01 10:19:58 log INFO - 2021-07-01 10:19:58,816 Stage-1 map = 36%, reduce = 0%, Cumulative CPU 2598.88 sec
2021-07-01 10:19:59 log INFO - 2021-07-01 10:19:59,857 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 2949.73 sec
2021-07-01 10:20:00 log INFO - 2021-07-01 10:20:00,905 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 3638.16 sec
2021-07-01 10:20:01 log INFO - 2021-07-01 10:20:01,947 Stage-1 map = 48%, reduce = 0%, Cumulative CPU 4049.69 sec
2021-07-01 10:20:02 log INFO - 2021-07-01 10:20:02,986 Stage-1 map = 51%, reduce = 0%, Cumulative CPU 4272.81 sec
2021-07-01 10:20:04 log INFO - 2021-07-01 10:20:04,035 Stage-1 map = 53%, reduce = 0%, Cumulative CPU 4553.64 sec
2021-07-01 10:20:05 log INFO - 2021-07-01 10:20:05,085 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 4941.65 sec
2021-07-01 10:20:06 log INFO - 2021-07-01 10:20:06,124 Stage-1 map = 61%, reduce = 0%, Cumulative CPU 5443.33 sec
2021-07-01 10:20:07 log INFO - 2021-07-01 10:20:07,169 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 6004.05 sec
2021-07-01 10:20:08 log INFO - 2021-07-01 10:20:08,210 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6370.51 sec
2021-07-01 10:20:09 log INFO - 2021-07-01 10:20:09,249 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 6869.53 sec
2021-07-01 10:20:10 log INFO - 2021-07-01 10:20:10,288 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 7293.59 sec
2021-07-01 10:20:11 log INFO - 2021-07-01 10:20:11,329 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 7739.03 sec
2021-07-01 10:20:12 log INFO - 2021-07-01 10:20:12,372 Stage-1 map = 71%, reduce = 0%, Cumulative CPU 8230.34 sec
2021-07-01 10:20:13 log INFO - 2021-07-01 10:20:13,415 Stage-1 map = 72%, reduce = 0%, Cumulative CPU 8593.03 sec
2021-07-01 10:20:14 log INFO - 2021-07-01 10:20:14,455 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 8884.0 sec
2021-07-01 10:20:15 log INFO - 2021-07-01 10:20:15,495 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 9217.56 sec
2021-07-01 10:20:17 log INFO - 2021-07-01 10:20:17,576 Stage-1 map = 77%, reduce = 0%, Cumulative CPU 9502.69 sec
2021-07-01 10:20:19 log INFO - 2021-07-01 10:20:19,666 Stage-1 map = 79%, reduce = 0%, Cumulative CPU 9907.57 sec
2021-07-01 10:20:22 log INFO - 2021-07-01 10:20:22,809 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 10226.92 sec
2021-07-01 10:20:27 log INFO - 2021-07-01 10:20:27,016 Stage-1 map = 81%, reduce = 0%, Cumulative CPU 10641.08 sec
2021-07-01 10:20:34 log INFO - 2021-07-01 10:20:34,631 Stage-1 map = 82%, reduce = 0%, Cumulative CPU 11264.2 sec
2021-07-01 10:20:38 log INFO - 2021-07-01 10:20:38,800 Stage-1 map = 83%, reduce = 0%, Cumulative CPU 11534.41 sec
2021-07-01 10:20:46 log INFO - 2021-07-01 10:20:46,099 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 11949.88 sec
2021-07-01 10:20:49 log INFO - 2021-07-01 10:20:49,223 Stage-1 map = 84%, reduce = 4%, Cumulative CPU 12164.8 sec
2021-07-01 10:20:50 log INFO - 2021-07-01 10:20:50,266 Stage-1 map = 84%, reduce = 9%, Cumulative CPU 12301.65 sec
2021-07-01 10:20:51 log INFO - 2021-07-01 10:20:51,306 Stage-1 map = 84%, reduce = 14%, Cumulative CPU 12450.18 sec
2021-07-01 10:20:52 log INFO - 2021-07-01 10:20:52,351 Stage-1 map = 84%, reduce = 18%, Cumulative CPU 12578.82 sec
2021-07-01 10:20:53 log INFO - 2021-07-01 10:20:53,391 Stage-1 map = 85%, reduce = 22%, Cumulative CPU 12710.59 sec
2021-07-01 10:20:54 log INFO - 2021-07-01 10:20:54,435 Stage-1 map = 85%, reduce = 25%, Cumulative CPU 12802.93 sec
2021-07-01 10:20:55 log INFO - 2021-07-01 10:20:55,486 Stage-1 map = 85%, reduce = 27%, Cumulative CPU 12897.56 sec
2021-07-01 10:20:59 log INFO - 2021-07-01 10:20:59,648 Stage-1 map = 85%, reduce = 28%, Cumulative CPU 13117.54 sec
2021-07-01 10:21:04 log INFO - 2021-07-01 10:21:04,854 Stage-1 map = 86%, reduce = 28%, Cumulative CPU 13317.26 sec
2021-07-01 10:21:19 log INFO - 2021-07-01 10:21:19,433 Stage-1 map = 86%, reduce = 29%, Cumulative CPU 13803.17 sec
2021-07-01 10:21:48 log INFO - 2021-07-01 10:21:48,576 Stage-1 map = 87%, reduce = 29%, Cumulative CPU 14941.58 sec
2021-07-01 10:22:48 log INFO - 2021-07-01 10:22:48,970 Stage-1 map = 87%, reduce = 29%, Cumulative CPU 16916.36 sec
2021-07-01 10:23:16 log INFO - 2021-07-01 10:23:16,107 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 17681.74 sec
2021-07-01 10:24:16 log INFO - 2021-07-01 10:24:16,702 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 19030.01 sec
2021-07-01 10:24:39 log INFO - 2021-07-01 10:24:39,607 Stage-1 map = 89%, reduce = 29%, Cumulative CPU 19543.91 sec
2021-07-01 10:25:05 log INFO - 2021-07-01 10:25:05,628 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 20126.07 sec
2021-07-01 10:26:06 log INFO - 2021-07-01 10:26:06,084 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 21018.0 sec
2021-07-01 10:27:06 log INFO - 2021-07-01 10:27:06,474 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 21879.94 sec
2021-07-01 10:28:07 log INFO - 2021-07-01 10:28:07,046 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 22815.09 sec
2021-07-01 10:29:07 log INFO - 2021-07-01 10:29:07,398 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 23958.32 sec
2021-07-01 10:30:07 log INFO - 2021-07-01 10:30:07,695 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 25010.02 sec
2021-07-01 10:30:11 log INFO - 2021-07-01 10:30:11,860 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 25097.77 sec
2021-07-01 10:31:12 log INFO - 2021-07-01 10:31:12,237 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 26114.11 sec
2021-07-01 10:32:12 log INFO - 2021-07-01 10:32:12,518 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 27115.18 sec
2021-07-01 10:33:12 log INFO - 2021-07-01 10:33:12,871 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 28257.23 sec
2021-07-01 10:34:13 log INFO - 2021-07-01 10:34:13,258 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 29457.73 sec
2021-07-01 10:35:13 log INFO - 2021-07-01 10:35:13,700 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 30644.47 sec
2021-07-01 10:36:13 log INFO - 2021-07-01 10:36:13,863 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 31783.64 sec
2021-07-01 10:37:14 log INFO - 2021-07-01 10:37:14,015 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 32519.44 sec
2021-07-01 10:38:14 log INFO - 2021-07-01 10:38:14,147 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 33398.86 sec
2021-07-01 10:39:14 log INFO - 2021-07-01 10:39:14,256 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 34157.51 sec
2021-07-01 10:40:14 log INFO - 2021-07-01 10:40:14,463 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 35916.48 sec
2021-07-01 10:41:14 log INFO - 2021-07-01 10:41:14,682 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 38235.93 sec
2021-07-01 10:42:14 log INFO - 2021-07-01 10:42:14,796 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 40046.69 sec
2021-07-01 10:43:15 log INFO - 2021-07-01 10:43:15,041 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 40877.78 sec
2021-07-01 10:44:15 log INFO - 2021-07-01 10:44:15,070 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 41510.4 sec
2021-07-01 10:45:15 log INFO - 2021-07-01 10:45:15,312 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 42102.11 sec
2021-07-01 10:46:15 log INFO - 2021-07-01 10:46:15,404 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 42606.47 sec
2021-07-01 10:47:15 log INFO - 2021-07-01 10:47:15,506 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 43235.72 sec
2021-07-01 10:48:15 log INFO - 2021-07-01 10:48:15,632 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 44009.25 sec
2021-07-01 10:49:15 log INFO - 2021-07-01 10:49:15,718 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 44663.71 sec
2021-07-01 10:50:15 log INFO - 2021-07-01 10:50:15,798 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 45339.24 sec
2021-07-01 10:51:15 log INFO - 2021-07-01 10:51:15,925 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 46219.58 sec
2021-07-01 10:52:01 log INFO - 2021-07-01 10:52:01,519 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 47109.44 sec
2021-07-01 10:53:01 log INFO - 2021-07-01 10:53:01,753 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 48380.21 sec
2021-07-01 10:54:01 log INFO - 2021-07-01 10:54:01,920 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 49759.68 sec
2021-07-01 10:55:01 log INFO - 2021-07-01 10:55:01,995 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 50736.85 sec
2021-07-01 10:56:02 log INFO - 2021-07-01 10:56:02,004 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 51564.75 sec
2021-07-01 10:56:35 log INFO - 2021-07-01 10:56:35,153 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 52149.04 sec
2021-07-01 10:57:35 log INFO - 2021-07-01 10:57:35,269 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 52934.9 sec
2021-07-01 10:58:35 log INFO - 2021-07-01 10:58:35,426 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 54443.17 sec
2021-07-01 10:59:35 log INFO - 2021-07-01 10:59:35,620 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 56427.32 sec
2021-07-01 11:00:35 log INFO - 2021-07-01 11:00:35,780 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 58277.61 sec
2021-07-01 11:00:36 log INFO - 2021-07-01 11:00:36,816 Stage-1 map = 93%, reduce = 30%, Cumulative CPU 58306.46 sec
2021-07-01 11:01:36 log INFO - 2021-07-01 11:01:36,994 Stage-1 map = 93%, reduce = 30%, Cumulative CPU 59500.57 sec
2021-07-01 11:02:37 log INFO - 2021-07-01 11:02:37,186 Stage-1 map = 93%, reduce = 30%, Cumulative CPU 60221.41 sec
2021-07-01 11:03:37 log INFO - 2021-07-01 11:03:37,307 Stage-1 map = 93%, reduce = 30%, Cumulative CPU 60730.51 sec
2021-07-01 11:04:37 log INFO - 2021-07-01 11:04:37,494 Stage-1 map = 93%, reduce = 30%, Cumulative CPU 61053.13 sec
2021-07-01 11:05:19 log INFO - 2021-07-01 11:05:19,068 Stage-1 map = 94%, reduce = 30%, Cumulative CPU 61357.76 sec
2021-07-01 11:06:19 log INFO - 2021-07-01 11:06:19,283 Stage-1 map = 94%, reduce = 30%, Cumulative CPU 61765.52 sec
2021-07-01 11:07:03 log INFO - 2021-07-01 11:07:03,927 Stage-1 map = 95%, reduce = 30%, Cumulative CPU 62209.08 sec
2021-07-01 11:07:04 log INFO - 2021-07-01 11:07:04,966 Stage-1 map = 95%, reduce = 31%, Cumulative CPU 62214.65 sec
2021-07-01 11:08:05 log INFO - 2021-07-01 11:08:05,107 Stage-1 map = 95%, reduce = 31%, Cumulative CPU 62675.88 sec
2021-07-01 11:08:33 log INFO - 2021-07-01 11:08:33,193 Stage-1 map = 96%, reduce = 31%, Cumulative CPU 62905.03 sec
2021-07-01 11:09:16 log INFO - 2021-07-01 11:09:16,730 Stage-1 map = 96%, reduce = 32%, Cumulative CPU 63276.93 sec
2021-07-01 11:09:21 log INFO - 2021-07-01 11:09:21,906 Stage-1 map = 97%, reduce = 32%, Cumulative CPU 63305.22 sec
2021-07-01 11:10:07 log INFO - 2021-07-01 11:10:07,499 Stage-1 map = 98%, reduce = 32%, Cumulative CPU 63655.26 sec
2021-07-01 11:11:07 log INFO - 2021-07-01 11:11:07,681 Stage-1 map = 98%, reduce = 32%, Cumulative CPU 64020.02 sec
2021-07-01 11:11:13 log INFO - 2021-07-01 11:11:13,916 Stage-1 map = 98%, reduce = 33%, Cumulative CPU 64044.0 sec
2021-07-01 11:11:28 log INFO - 2021-07-01 11:11:28,456 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 64116.01 sec
2021-07-01 11:12:28 log INFO - 2021-07-01 11:12:28,651 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 64356.53 sec
2021-07-01 11:13:28 log INFO - 2021-07-01 11:13:28,772 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 64584.49 sec
2021-07-01 11:14:28 log INFO - 2021-07-01 11:14:28,914 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 64822.4 sec
2021-07-01 11:15:29 log INFO - 2021-07-01 11:15:29,062 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 65080.55 sec
2021-07-01 11:16:29 log INFO - 2021-07-01 11:16:29,172 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 65326.9 sec
2021-07-01 11:17:29 log INFO - 2021-07-01 11:17:29,215 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 65529.33 sec
2 资源管理界面查看详情
点击链接(此处已做处理,打不开哟):http://c2-xxx-hadoopxxx.bj:8088/proxy/application_xxxxxxxx_16902/,打开后如下图所示,依次找到正在执行的reduce tasks。
发现,所有reduce tasks的执行时间都一样,一样的慢。而reduce们都卡在了copy阶段,也就是从map拉取数据到reduce的阶段。
3 思考过程
如下为个人顺序思考过程,有不合适、不正确的地方还请务必留言指导,小编必非常感谢。
-
所有reduce tasks的执行时间都一样,并不是有的很短,有的很长,排除key不均匀导致的数据倾斜问题。
-
卡顿发生在copy阶段,猜测可能是map结果的数据量太大,导致的拉取数据缓慢。
map结果数据量太大,猜测关联键中出现空值或者出现1对多的情况。
回到原始HQL代码,发现结果为用户明细数据。并且,关联键无空值,无1对多的情况,所以数据量虽大,但问题并不出自代码。
select uid,
...
from
(
select uid
...
group by uid
)t1
left join
(
select uid
...
group by uid
)
t2 on t1.uid = t2.uid
...
3.既然是单纯的map结果数据量大,所以reducer从map拉取数据较慢。猜测reducer拉取数据时与未结束的map竞争资源 ,导致死锁。
参考网址:MapReduce任务运行慢问题排查
在mapreduce流程中,reduce启动后需要先copy map的相关shuffle数据,全部搜集完所有map的数据后才能处理。
hadoop为了加快mapreduce job的整体速度,会提前启动一些reduce,先去拉取那些已经完成的map的数据。
但这样会存在隐患,即资源有限时,reduce占用了map的资源,还在等待map的数据,map又在等待资源以运行任务,这就形成了一个死锁。在发生死锁一定时间后,hadoop就会kill掉那个reduce task。
这种情况可以通过调整参数推迟reduce task的执行:
set mapreduce.job.reduce.slowstart.completedmaps = 0.95; – 有95%的map task执行完后,reduce task才能开始copy数据。防止map和reduce竞争资源,导致死锁
mapreduce.job.reduce.slowstart.completedmaps
:默认是0.05,即只要有5%的map task执行完,reduce task就开始copy数据。小编的程序中,因为reduce拉取总数据量达到38亿,所以小编干脆调整到了0.95,即map差不多执行完后,reduce再拉取数据。该程序是离线计算指标,计算时间长还可以承受。