MapReduce详解(一)
MapReduce:
hadoop1.x MR1(计算+资源作业调度)
hadoop2.x MR2(计算) + Yarn(资源调度)
MR1进程:
JobTracker
TaskTracker: map task 、reduce task
MR2: 写代码打成jar包提交给yarn运行即可
1.不需要部署
2.架构设计
-->MR Job提交到Yarn的工作流程-->Yarn架构设计、Yarn工作流程
Yarn进程:
ResourceManager:
Application Manager:应用程序管理
Scheduler: 调度器
NodeManger:
Container: 容器(*****) Yarn的资源抽象的概念,封装了某个NM的多维度资源:memory+cpu
容器里运行map task、reduce task
MR Application Master:每个MR的作业只有一个,且是运行在NM的container中
3.词频统计:wordcount
[hadoop@xkhadoop hadoop]$ vi 1.log
ddd 23 343 55533
3423 343 34
454 35
[hadoop@xkhadoop hadoop]$ hdfs dfs -mkdir -p /wordcount/input
[hadoop@xkhadoop hadoop]$ hdfs dfs -put 1.log /wordcount/input
[hadoop@xkhadoop hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar \
> wordcount /wordcount/input /wordcount/output1
[hadoop@xkhadoop hadoop]$ hdfs dfs -ls /wordcount/output1
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-12-04 23:10 /wordcount/output1/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 48 2018-12-04 23:10 /wordcount/output1/part-r-00000
[hadoop@xkhadoop hadoop]$ hdfs dfs -cat /wordcount/output1/part-r-00000
23 1
34 1
3423 1
343 2
35 1
454 1
55533 1
ddd 1
原理(个人理解):根据空格进行spit,组合成一个数组,经过shuffle(洗牌),把所有相似的分到一起,再求分到每个空间里面的count
4.shuffle 洗牌 调优点 hive+spark
Map: 映射
Reduce: 归约
转载一篇介绍的较为详细的博客:
http://blog.itpub.net/30089851/viewspace-2095837/