Map Reduce执行流程以及Hive执行insert overwrite 底层是怎么跑数据的

18 篇文章 0 订阅
13 篇文章 1 订阅

目录

MR简述

MR执行流程

Input阶段

Mapper阶段

Reducer阶段

例子

insert overwrite table

翻译日志

 


MR简述

MapReduce 作业通常将输入数据集分割成独立的块,这些块由 map 任务以完全并行的方式进行处理。MR框架对映射的输出进行排序,然后将其输入到 reduce 任务中。通常,作业的输入和输出都存储在文件系统中。该框架负责调度任务、监视任务并重新执行失败的任务。

通常,计算节点和存储节点是相同的,也就是说,MapReduce 框架和 Hadoop 分布式文件系统在同一组节点上运行。这种配置允许框架在数据已经存在的节点上有效地调度任务,从而产生跨集群的高聚合带宽。

MR执行流程

Input阶段

  • JobClient 输入输入文件的存储位置
  • JobClient 通过 InputFormat 接口指定分割的逻辑,默认是按照 HDFS 文件分隔,即有多少个数据块就有多少个 maps
  • Hadoop 再次把文件分割为 <key, value> 类型的数据
  • JobTracker 负责分配对应的数据块由对应的 mapper 处理,同时 RecordReader 负责读取 KV 值

Mapper阶段

  • Jobtracker 会将 maptask 分发给 TaskTracker 去执行
  • mapper 接收到数据之后,会在 Hadoop 的内存缓冲区(默认大小100M)中按照 key 进行排序分区操作,待内存写满之后,会写入磁盘
  • 将不同 key 的数据进行 partition 操作,即相同 key 的数据在同一个分区当中
  • 在单个分区内按照 key 进行排序 sort 操作
  • Combiner 阶段(可选),从所有 map 主机上把相同的 key 的 key value 对组合在一起,减小 reduce 的压力
  • map 将数据写出到文件系统

Reducer阶段

Reducer 有三个主要阶段:grouping,sortpartiton,reduce

  • Shuffle:reduce 通过 HTTP 获取所有 mapper 输出的分区数据
  • Grouping:Reduce 将不同 map 端的相同 key 的数据进行分组,因为不同的 mapper 可能输入相同 key 的数据
  • Sort:在 reduce 阶段进行,对 mapper 的数据会进行排序的操作,这个过程称为 sort

注意:Shuffle 和 Sort 同时进行,当获取 map 输出时,它们被合并。

  • Secondary Sort:如果使用  Job.setSortComparatorClass(Class),那么就会对数据进行自定义排序。由于  Job.setGroupingComparatorClass(Class) 可用于控制中间键的分组方式,因此可以将这些键与值结合使用来模拟二次排序
  • Reduce:使用 reduce(WritableComparable, Iterable<Writable>, Context) 方法遍历 <key, (list of values)> 类型的数据,最后将数据写入文件系统

Reduce 的数量可以由 Job.setNumReduceTasks(int) 指定。//数量必须比 partition 分区的数量大,不然会报错

例子

hive 任务的底层就是 MapReduce 任务

insert overwrite table

0: jdbc:hive2://hiveserver2.bigdata.chinatele> insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
. . . . . . . . . . . . . . . . . . . . . . .> select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007;
INFO  : Compiling command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45): insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a.mdn, type:string, comment:null), FieldSchema(name:a.r_trmnl_brand, type:string, comment:null), FieldSchema(name:a.r_trmnl_model, type:string, comment:null), FieldSchema(name:a.r_use_day, type:string, comment:null), FieldSchema(name:a.d_trmnl_brand, type:string, comment:null), FieldSchema(name:a.d_trmnl_model, type:string, comment:null), FieldSchema(name:a.d_use_day, type:string, comment:null), FieldSchema(name:a.data_day, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45); Time taken: 0.412 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45): insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007
INFO  : Query ID = hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45
INFO  : Total jobs = 3
INFO  : Launching Job 1 out of 3
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:237
INFO  : Submitting tokens for job: job_1569295562481_2677748
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns4, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407251, maxDate=1571283207251, sequenceNumber=99889585, masterKeyId=889)
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns3, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407264, maxDate=1571283207264, sequenceNumber=100362646, masterKeyId=873)
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678406757, maxDate=1571283206757, sequenceNumber=381027444, masterKeyId=1165)
INFO  : Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2ImpersonationToken, Ident: 00 16 6a 74 5f 6a 74 73 6a 7a 78 73 6a 79 79 63 5f 73 63 5f 66 77 66 7a 16 6a 74 5f 6a 74 73 6a 7a 78 73 6a 79 79 63 5f 73 63 5f 66 77 66 7a 3f 68 69 76 65 2f 68 69 76 65 73 65 72 76 65 72 32 2e 62 69 67 64 61 74 61 2e 63 68 69 6e 61 74 65 6c 65 63 6f 6d 2e 63 6e 40 48 41 44 4f 4f 50 2e 43 48 49 4e 41 54 45 4c 45 43 4f 4d 2e 43 4e 8a 01 6d b3 a7 a2 5e 8a 01 6d d7 b4 26 5e 8e 77 aa 8e 19 30
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns2, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407250, maxDate=1571283207250, sequenceNumber=110691977, masterKeyId=871)
INFO  : The url to track the job: http://NM-304-RH5885V3-BIGDATA-008:8088/proxy/application_1569295562481_2677748/
INFO  : Starting Job = job_1569295562481_2677748, Tracking URL = http://NM-304-RH5885V3-BIGDATA-008:8088/proxy/application_1569295562481_2677748/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1569295562481_2677748
INFO  : Hadoop job information for Stage-1: number of mappers: 237; number of reducers: 0
INFO  : 2019-10-10 11:35:34,427 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-10-10 11:36:18,032 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 10.65 sec
INFO  : 2019-10-10 11:36:19,080 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 96.18 sec
INFO  : 2019-10-10 11:36:20,131 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 224.8 sec
INFO  : 2019-10-10 11:36:21,181 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 342.52 sec
INFO  : 2019-10-10 11:36:22,243 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 444.37 sec
INFO  : 2019-10-10 11:36:23,304 Stage-1 map = 47%,  reduce = 0%, Cumulative CPU 1014.68 sec
INFO  : 2019-10-10 11:36:24,582 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 1638.8 sec
================================中间有一部分略
INFO  : 2019-10-10 11:36:37,280 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 2181.11 sec
INFO  : 2019-10-10 11:36:48,674 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 2209.67 sec
INFO  : 2019-10-10 11:37:01,292 Stage-1 map = 90%,  reduce = 0%, Cumulative CPU 2299.86 sec
INFO  : 2019-10-10 11:37:07,494 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU 2322.2 sec
INFO  : 2019-10-10 11:37:12,849 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 2335.47 sec
INFO  : 2019-10-10 11:37:13,886 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 2363.0 sec
INFO  : 2019-10-10 11:37:14,922 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 2372.74 sec
INFO  : 2019-10-10 11:39:16,852 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 2386.92 sec

INFO  : 2019-10-10 11:46:52,457 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2398.68 sec
INFO  : MapReduce Total cumulative CPU time: 39 minutes 58 seconds 680 msec
INFO  : Ended Job = job_1569295562481_2677748
INFO  : Starting task [Stage-7:CONDITIONAL] in serial mode
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Starting task [Stage-4:MOVE] in serial mode
INFO  : Moving data to: viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10000 from viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10002
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d partition (data_day=null) from viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10000
INFO  :          Time taken for load dynamic partitions : 37085
INFO  :         Loading partition {data_day=20190915}
INFO  :         Loading partition {data_day=20190624}
INFO  :         Loading partition {data_day=20190906}
INFO  :         Loading partition {data_day=20190902}
INFO  :         Loading partition {data_day=20190909}
================================中间有一部分略
INFO  :         Loading partition {data_day=20190901}
INFO  :         Loading partition {data_day=20190916}
INFO  :         Loading partition {data_day=20190908}
INFO  :         Loading partition {data_day=20190723}
INFO  :          Time taken for adding to write entity : 12
INFO  : Starting task [Stage-2:STATS] in serial mode
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191001} stats: [numFiles=1, numRows=106817, totalSize=5047510, rawDataSize=4940693]
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191002} stats: [numFiles=1, numRows=142186, totalSize=7349564, rawDataSize=7207378]
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191003} stats: [numFiles=1, numRows=146760, totalSize=7585261, rawDataSize=7438501]
================================中间有一部分略
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191004} stats: [numFiles=1, numRows=115010, totalSize=5880787, rawDataSize=5765777]
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191005} stats: [numFiles=1, numRows=128308, totalSize=6711669, rawDataSize=6583361]
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191006} stats: [numFiles=1, numRows=104644, totalSize=5418150, rawDataSize=5313506]
INFO  : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191007} stats: [numFiles=1, numRows=89627, totalSize=4577004, rawDataSize=4487377]
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 237   Cumulative CPU: 2398.68 sec   HDFS Read: 21622480200 HDFS Write: 459476088 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 39 minutes 58 seconds 680 msec
INFO  : Completed executing command(queryId=hive_2019101011 3232_c82e18e3-5853-43c5-9856-d1d1c55dde45); Time taken: 1008.946 seconds
INFO  : OK
No rows affected (1009.374 seconds)

翻译日志

  1. 开始编译 insert 命令,得到一个 queryId
  2. 分析 hivesql,语义分析完成
  3. 返回 Hive schema:包含 FieldSchema(字段名称,字段类型,字段注释)、properties
  4. 编译完成,返回 queryId(与第一步的queryId相同)和编译耗时(以秒为单位)
  5. Info:禁用并发模式,不创建锁管理器
  6. 执行 insert 命令
  7. Info:queryId
  8. 3个job
  9. 启动job1
  10. 在串行模式下启动 task [Stage-1:MAPRED]
  11. 因为没有 reduce 算子,所以 reduce 任务的数量被设置为0
  12. splits 数量:237
  13. 作业令牌标识:job_1569295562481_2677748
  14. ...map
  15. 作业在 YARN 上的 url 路径:http://host:port/proxy/application_1569295562481_2677748/
  16. 开始任务:Job = job_jobName, Tracking URL = http://host:port/proxy/application_1569295562481_2677748/
  17. 杀死任务命令:/usr/lib/hadoop/bin/hadoop job  -kill job_jobName
  18. Stage-1 Hadoop 作业信息:mapper 数量237,reduce 数量0
  19. map 作业进度的百分比......包含(map=X%,reduce=X%,累计CPU xx秒)
  20. MapReduce 总累积 CPU 时间:39分58秒680毫秒
  21. Info:Job = job_Name
  22. 在串行模式下启动task [Stage-7:CONDITIONAL]
  23. 阶段4由条件解析器选择。
  24. 阶段3被条件分解器过滤掉了。
  25. 阶段5被条件分解器过滤掉了。
  26. 以串行方式启动任务[阶段4:移动]
  27. 将数据从源移动到结果端
  28. 在串行模式下启动任务[阶段-0:移动]
  29. 将 HDFS 中的数据加载到 Hive 表当中
  30. 加载动态分区所需的时间:37085
  31. Loading......
  32. 添加写入实体所需时间:12
  33. 在串行模式下启动task [Stage-2:STATS]
  34. Info:分区信息,包含(db.table{data_day=20190620},stats: [numFiles=1, numRows=222592, totalSize=11316154, rawDataSize=11093562])
  35. MapReduce 任务结束
  36. Stage-Stage-1: Map: 237   Cumulative CPU: 2398.68 sec   HDFS Read: 21622480200 HDFS Write: 459476088 SUCCESS
  37. 总MapReduce CPU时间花费:39 minutes 58 seconds 680 msec
  38. 完成执行命令,实际时间:1008.946 seconds
  39. 没有行受影响(总耗时:1009.374 seconds)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值