MapReduce编程

最新推荐文章于 2024-05-11 09:00:00 发布

maclaren001

最新推荐文章于 2024-05-11 09:00:00 发布

阅读量473

点赞数

分类专栏： hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/maclaren001/article/details/25507023

版权

hadoop 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

运行mapreduce方式：
1.在eclipse运行
2.在命令行下

mkdir firstDir
javac -classpath ~/hadoop/hadoop-0.20.2-core.jar -d firstDir WordCount.java
jar -cvf WordCount.jar -C firstDir/
sh hadoop dfs -mkdir input
sh hadoop dfs -put ~/input/file0* input
hadoop jar WordCount.jar WordCount input output

1.Hadoop version0.20.2版本开始提供新的api，新旧api是不兼容的
旧的api
mapper

public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) {
    ...
    output.collect();
  }
}

reduce

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) {
    ...
output.collect();
  }
}

新的api
mapper

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  public void map(LongWritable key, Text value, Context context) {
    ...
    context.write()
  }
}

reducer

public class Reduce extends Mapper<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, Context context) {
    ... 
    context.write();
  }
}

2.控制流和数据流
控制流：JobTracker调度任务给TaskTracker，TaskTracker报告进度给JobTracker，如果TaskTracker失败，JobTracker再分配给其他的TaskTracker

数据流：数据按照TextInputFormat被处理为多个InputSplit,然后输入到多个Map,Map会读取InputSplit指定位置的数据，然后按照设定的方式处理该数据，最后写道本地。（这里map没有输出到hdfs，是因为考虑到网络消耗和失败重做的可行性。但是reduce会读取map的输出数据，然后输出到hdfs中，这就不可避免）

注意：
1.reduce task的个数可以指定，有多少个reduce task 就有多少个文件输出
2.如果没有reduce task,map task的输出就是最后输出

3.mapreduce任务优化
主要在于计算性能的优化和网络I/O的优化
1）尽量将map任务分配给inputsplit所在的机器，减少网络I/O消耗
2）mapreduce擅长处理少量大数据，不擅长处理大量小数据。设置block块大小，在FileInputFormat中，hadoop会处理每个block后将其作为一个InputSplit。提交一个mapreduce任务前预处理，让map任务运行在1分钟左右。
3）reduce任务数量应该是reduce任务槽的0.95到1.75倍。越小，当reduce任务失败的时候越快找到一个空闲的机器；越大，一个任务完成后机器能更快地开始另外一个reduce任务。
4）在map阶段使用combine函数合并，减少网络I/O。使用方式：在程序中添加job.setCombinerClass(combine.class)
5)压缩。时间和空间的考量
6）自定义comparator.

4.hadoop流
hadoop流提供了一个api，允许用户使用任何脚本语言写map和reduce函数，因为它使用unix的标准输入输出作为程序与hadoop之间的接口。

例子
1）linux自带命令

sh hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -input input -output output -mapper /bin/cat -reducer /usr/bin/wc

2)linux自带命令和自定义类

sh hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -input myinputdir -output myoutputdir -mapper org.apache.hadoop.mapred.lib.IdentityMapper -reducer /bin/wc

3)bash命令

sh hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -input input -output output -mapper /bin/cat -reduce ~/Desktop/test/reducer.sh -file ~/Desktop/test/reducer.sh

4)reducer.sh文件

grep hadoop

5)python
Reduce.py文件：

#!/usr/bin/python


import sys;


def generateLongCountToken(id):
return "LongValueSum:" +id + "\t" + "1"
def main(argv):
line = ssy.stdin.readline();
try:
	while line:
		line = line[:-1];
		fields = line.split("\t");
		print generateLongCountToken(fields[0]);
		line = sys.stdin.readline();
except "end of file":
	return None
if __name__ == "__main__":
main(sys.argv)

sh hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -input input -output output -mapper Reduce.py -reducer aggregate -file Reduce.py

aggregate是hadoop提供的一个包，提供一个Reduce函数和combine函数，实现一些诶简单的类似求和、取最大值最小值等功能。

5.在MapReduce Job中使用共享缓存
1)读写HDFS文件
2)配置job属性
3)使用DistrictedCache

6.链接MapReduce Job
某些复杂的问题一个MapReduce作业不能完成，要多个作业。这时需要链接MapReduce Job
1)线性MapReduce Job流

2)复杂MapReduce Job流
使用org.apache.hadoop.mapreduce.lib.jobcontrol

3)job设置预处理和后处理过程