目录
MR简述
MapReduce 作业通常将输入数据集分割成独立的块,这些块由 map 任务以完全并行的方式进行处理。MR框架对映射的输出进行排序,然后将其输入到 reduce 任务中。通常,作业的输入和输出都存储在文件系统中。该框架负责调度任务、监视任务并重新执行失败的任务。
通常,计算节点和存储节点是相同的,也就是说,MapReduce 框架和 Hadoop 分布式文件系统在同一组节点上运行。这种配置允许框架在数据已经存在的节点上有效地调度任务,从而产生跨集群的高聚合带宽。
MR执行流程
Input阶段
- JobClient 输入输入文件的存储位置
- JobClient 通过 InputFormat 接口指定分割的逻辑,默认是按照 HDFS 文件分隔,即有多少个数据块就有多少个 maps
- Hadoop 再次把文件分割为 <key, value> 类型的数据
- JobTracker 负责分配对应的数据块由对应的 mapper 处理,同时 RecordReader 负责读取 KV 值
Mapper阶段
- Jobtracker 会将 maptask 分发给 TaskTracker 去执行
- mapper 接收到数据之后,会在 Hadoop 的内存缓冲区(默认大小100M)中按照 key 进行排序分区操作,待内存写满之后,会写入磁盘
- 将不同 key 的数据进行 partition 操作,即相同 key 的数据在同一个分区当中
- 在单个分区内按照 key 进行排序 sort 操作
- Combiner 阶段(可选),从所有 map 主机上把相同的 key 的 key value 对组合在一起,减小 reduce 的压力
- map 将数据写出到文件系统
Reducer阶段
Reducer 有三个主要阶段:grouping,sortpartiton,reduce
- Shuffle:reduce 通过 HTTP 获取所有 mapper 输出的分区数据
- Grouping:Reduce 将不同 map 端的相同 key 的数据进行分组,因为不同的 mapper 可能输入相同 key 的数据
- Sort:在 reduce 阶段进行,对 mapper 的数据会进行排序的操作,这个过程称为 sort
注意:Shuffle 和 Sort 同时进行,当获取 map 输出时,它们被合并。
- Secondary Sort:如果使用 Job.setSortComparatorClass(Class),那么就会对数据进行自定义排序。由于 Job.setGroupingComparatorClass(Class) 可用于控制中间键的分组方式,因此可以将这些键与值结合使用来模拟二次排序
- Reduce:使用 reduce(WritableComparable, Iterable<Writable>, Context) 方法遍历 <key, (list of values)> 类型的数据,最后将数据写入文件系统
Reduce 的数量可以由 Job.setNumReduceTasks(int) 指定。//数量必须比 partition 分区的数量大,不然会报错
例子
hive 任务的底层就是 MapReduce 任务
insert overwrite table
0: jdbc:hive2://hiveserver2.bigdata.chinatele> insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
. . . . . . . . . . . . . . . . . . . . . . .> select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007;
INFO : Compiling command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45): insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a.mdn, type:string, comment:null), FieldSchema(name:a.r_trmnl_brand, type:string, comment:null), FieldSchema(name:a.r_trmnl_model, type:string, comment:null), FieldSchema(name:a.r_use_day, type:string, comment:null), FieldSchema(name:a.d_trmnl_brand, type:string, comment:null), FieldSchema(name:a.d_trmnl_model, type:string, comment:null), FieldSchema(name:a.d_use_day, type:string, comment:null), FieldSchema(name:a.data_day, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45); Time taken: 0.412 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45): insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007
INFO : Query ID = hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45
INFO : Total jobs = 3
INFO : Launching Job 1 out of 3
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:237
INFO : Submitting tokens for job: job_1569295562481_2677748
INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns4, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407251, maxDate=1571283207251, sequenceNumber=99889585, masterKeyId=889)
INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns3, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407264, maxDate=1571283207264, sequenceNumber=100362646, masterKeyId=873)
INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678406757, maxDate=1571283206757, sequenceNumber=381027444, masterKeyId=1165)
INFO : Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2ImpersonationToken, Ident: 00 16 6a 74 5f 6a 74 73 6a 7a 78 73 6a 79 79 63 5f 73 63 5f 66 77 66 7a 16 6a 74 5f 6a 74 73 6a 7a 78 73 6a 79 79 63 5f 73 63 5f 66 77 66 7a 3f 68 69 76 65 2f 68 69 76 65 73 65 72 76 65 72 32 2e 62 69 67 64 61 74 61 2e 63 68 69 6e 61 74 65 6c 65 63 6f 6d 2e 63 6e 40 48 41 44 4f 4f 50 2e 43 48 49 4e 41 54 45 4c 45 43 4f 4d 2e 43 4e 8a 01 6d b3 a7 a2 5e 8a 01 6d d7 b4 26 5e 8e 77 aa 8e 19 30
INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns2, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407250, maxDate=1571283207250, sequenceNumber=110691977, masterKeyId=871)
INFO : The url to track the job: http://NM-304-RH5885V3-BIGDATA-008:8088/proxy/application_1569295562481_2677748/
INFO : Starting Job = job_1569295562481_2677748, Tracking URL = http://NM-304-RH5885V3-BIGDATA-008:8088/proxy/application_1569295562481_2677748/
INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1569295562481_2677748
INFO : Hadoop job information for Stage-1: number of mappers: 237; number of reducers: 0
INFO : 2019-10-10 11:35:34,427 Stage-1 map = 0%, reduce = 0%
INFO : 2019-10-10 11:36:18,032 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 10.65 sec
INFO : 2019-10-10 11:36:19,080 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 96.18 sec
INFO : 2019-10-10 11:36:20,131 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 224.8 sec
INFO : 2019-10-10 11:36:21,181 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 342.52 sec
INFO : 2019-10-10 11:36:22,243 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 444.37 sec
INFO : 2019-10-10 11:36:23,304 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 1014.68 sec
INFO : 2019-10-10 11:36:24,582 Stage-1 map = 64%, reduce = 0%, Cumulative CPU 1638.8 sec
================================中间有一部分略
INFO : 2019-10-10 11:36:37,280 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 2181.11 sec
INFO : 2019-10-10 11:36:48,674 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 2209.67 sec
INFO : 2019-10-10 11:37:01,292 Stage-1 map = 90%, reduce = 0%, Cumulative CPU 2299.86 sec
INFO : 2019-10-10 11:37:07,494 Stage-1 map = 92%, reduce = 0%, Cumulative CPU 2322.2 sec
INFO : 2019-10-10 11:37:12,849 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 2335.47 sec
INFO : 2019-10-10 11:37:13,886 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 2363.0 sec
INFO : 2019-10-10 11:37:14,922 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 2372.74 sec
INFO : 2019-10-10 11:39:16,852 Stage-1 map = 99%, reduce = 0%, Cumulative CPU 2386.92 sec
INFO : 2019-10-10 11:46:52,457 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2398.68 sec
INFO : MapReduce Total cumulative CPU time: 39 minutes 58 seconds 680 msec
INFO : Ended Job = job_1569295562481_2677748
INFO : Starting task [Stage-7:CONDITIONAL] in serial mode
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Starting task [Stage-4:MOVE] in serial mode
INFO : Moving data to: viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10000 from viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10002
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Loading data to table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d partition (data_day=null) from viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10000
INFO : Time taken for load dynamic partitions : 37085
INFO : Loading partition {data_day=20190915}
INFO : Loading partition {data_day=20190624}
INFO : Loading partition {data_day=20190906}
INFO : Loading partition {data_day=20190902}
INFO : Loading partition {data_day=20190909}
================================中间有一部分略
INFO : Loading partition {data_day=20190901}
INFO : Loading partition {data_day=20190916}
INFO : Loading partition {data_day=20190908}
INFO : Loading partition {data_day=20190723}
INFO : Time taken for adding to write entity : 12
INFO : Starting task [Stage-2:STATS] in serial mode
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191001} stats: [numFiles=1, numRows=106817, totalSize=5047510, rawDataSize=4940693]
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191002} stats: [numFiles=1, numRows=142186, totalSize=7349564, rawDataSize=7207378]
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191003} stats: [numFiles=1, numRows=146760, totalSize=7585261, rawDataSize=7438501]
================================中间有一部分略
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191004} stats: [numFiles=1, numRows=115010, totalSize=5880787, rawDataSize=5765777]
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191005} stats: [numFiles=1, numRows=128308, totalSize=6711669, rawDataSize=6583361]
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191006} stats: [numFiles=1, numRows=104644, totalSize=5418150, rawDataSize=5313506]
INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191007} stats: [numFiles=1, numRows=89627, totalSize=4577004, rawDataSize=4487377]
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 237 Cumulative CPU: 2398.68 sec HDFS Read: 21622480200 HDFS Write: 459476088 SUCCESS
INFO : Total MapReduce CPU Time Spent: 39 minutes 58 seconds 680 msec
INFO : Completed executing command(queryId=hive_2019101011 3232_c82e18e3-5853-43c5-9856-d1d1c55dde45); Time taken: 1008.946 seconds
INFO : OK
No rows affected (1009.374 seconds)
翻译日志
- 开始编译 insert 命令,得到一个 queryId
- 分析 hivesql,语义分析完成
- 返回 Hive schema:包含 FieldSchema(字段名称,字段类型,字段注释)、properties
- 编译完成,返回 queryId(与第一步的queryId相同)和编译耗时(以秒为单位)
- Info:禁用并发模式,不创建锁管理器
- 执行 insert 命令
- Info:queryId
- 3个job
- 启动job1
- 在串行模式下启动 task [Stage-1:MAPRED]
- 因为没有 reduce 算子,所以 reduce 任务的数量被设置为0
- splits 数量:237
- 作业令牌标识:job_1569295562481_2677748
- ...map
- 作业在 YARN 上的 url 路径:http://host:port/proxy/application_1569295562481_2677748/
- 开始任务:Job = job_jobName, Tracking URL = http://host:port/proxy/application_1569295562481_2677748/
- 杀死任务命令:/usr/lib/hadoop/bin/hadoop job -kill job_jobName
- Stage-1 Hadoop 作业信息:mapper 数量237,reduce 数量0
- map 作业进度的百分比......包含(map=X%,reduce=X%,累计CPU xx秒)
- MapReduce 总累积 CPU 时间:39分58秒680毫秒
- Info:Job = job_Name
- 在串行模式下启动task [Stage-7:CONDITIONAL]
- 阶段4由条件解析器选择。
- 阶段3被条件分解器过滤掉了。
- 阶段5被条件分解器过滤掉了。
- 以串行方式启动任务[阶段4:移动]
- 将数据从源移动到结果端
- 在串行模式下启动任务[阶段-0:移动]
- 将 HDFS 中的数据加载到 Hive 表当中
- 加载动态分区所需的时间:37085
- Loading......
- 添加写入实体所需时间:12
- 在串行模式下启动task [Stage-2:STATS]
- Info:分区信息,包含(db.table{data_day=20190620},stats: [numFiles=1, numRows=222592, totalSize=11316154, rawDataSize=11093562])
- MapReduce 任务结束
- Stage-Stage-1: Map: 237 Cumulative CPU: 2398.68 sec HDFS Read: 21622480200 HDFS Write: 459476088 SUCCESS
- 总MapReduce CPU时间花费:39 minutes 58 seconds 680 msec
- 完成执行命令,实际时间:1008.946 seconds
- 没有行受影响(总耗时:1009.374 seconds)