文章目录
先灌一波水,等我暑假一定好好用心整理博客!!
MapReduce编程模型
MapReduce处理过程
MapReduce程序框架
在运行hadoop程序的时候,要注意我们的程序要放在hdfs文件系统中,不能是本地文件
hdfs文件系统的根目录为
/hbase
我们以最简单的WordCount程序为例
程序的输入放在/hbase/input文件夹中
输出放在/hbase/output文件夹中
Hadoop MapReduce操作的对象是键值对,为了让这些键值对可以在集群上移动,Hadoop提供了一些实现了WritableComparable接口的基本数据类型,以便使这种数据可以被序列化进行网络传输,文件存储和大小比较
- 值:仅会被简单传递,故需实现Writable或WritableComparable接口
- 键:在Reduce阶段需要进行比较,故只能实现WritableComparable接口
类 | 描述 |
---|---|
BooleanWritable | 标准布尔型数值 |
ByteWritable | 单字节数值 |
DoubleWritable | 双字节数 |
FloatWritable | 浮点数 |
IntWritable | 整型数 |
LongWritable | 长整型数 |
Text | 使用UTF8格式存储的文本 |
NullWritable | 当<key,value>中的key或value为空时使用 |
Mapper过程
package com.lisong.hadoop;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
IntWritable one = new IntWritable(1);
Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
对于这个类,其模板参数为
- object:输入key的类型
- Text:输入value的类型
- 第二个Text:输出key的类型
- IntWritable:输出value的类型
输入的键值对:
- value为文本文件中的一行
- key为该行的首字母相对文本文件的首地址的偏移量
然后StringTokenizer类将每一行根据空格分割为一个个单词
- Mapper输出:<word,1>的键值对
Reduce过程
package com.lisong.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException {
int sum = 0;
for(IntWritable val:values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
}
public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
其中模板参数同Map一样,依次表示是输入键类型,输入值类型,输出键类型,输出值类型
public void reduce(Text key, Iterable<IntWritable> values, Context context)
reduce方法输入参数为
- key:单个单词
- values:各个Mapper上对应单词的统计次数所构成的列表,一个实现了Iterable接口的变量
所以reduce的工作很简单,就是遍历列表,得到对应单词的总次数
执行作业
package com.lisong.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if(otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
//设置完成Mapper过程的类
job.setMapperClass(TokenizerMapper.class);
//设置完成Combine和Reduce过程的类
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
//设置输出的键类型为Text
job.setOutputKeyClass(Text.class);
//设置输出的值类型为IntWritable
job.setOutputValueClass(IntWritable.class);
//设置输入输出路径
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
//调用方法,执行任务
System.exit(job.waitForCompletion(true)?0:1);
}
}
WordCount总流程
WordCount
WordCountMapper.java
package com.wcount;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, NullWritable, LongWritable>{
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, NullWritable, LongWritable>.Context context)
throws IOException, InterruptedException {
//get values string
String valueString = value.toString();
//分割字符串
String wArr[] = valueString.split(" ");
//map out key/value
context.write(NullWritable.get(), new LongWritable(wArr.length));
}
}
WordCountReducer.java
package com.wcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<NullWritable, LongWritable, NullWritable, LongWritable> {
@Override
protected void reduce(NullWritable key, Iterable<LongWritable> v2s,
Reducer<NullWritable, LongWritable, NullWritable, LongWritable>.Context context)
throws IOException, InterruptedException {
Iterator<LongWritable