下面是我自己的一些心得,希望对大家有所帮助。
我认为想要学习map reduce首先要了解map和reduce函数的各自作用:
Mapreduce是一个计算框架,既然是做计算的框架,那么表现形式就是有个输入(input),mapreduce操作这个输入(input),通过本身定义好的计算模型,得到一个输出(output),这个输出就是我们所需要的结果。
我们要学习的就是这个计算模型的运行规则。在运行一个mapreduce计算任务时候,任务过程被分为两个阶段:map阶段和reduce阶段,每个阶段都是用键值对(key/value)作为输入(input)和输出(output)。而程序员要做的就是定义好这两个阶段的函数:map函数和reduce函数。
Map的主要任务就是把输入的key value转换为指定的中间结果(其实也是key value),这个类主要包括了四个函数:
Setup一般是在执行map函数前做一些准备工作,map是主要的数据处理函数,cleanup则是在map执行完成后做一些清理工作和finally字句的作用很像,
还有就是map和reduce的各个参数代表的内容:
重写的map函数一般有三个参数,key字符偏移量,value要分析的具体数据,context读取源数据。
reduce函数是将经过map阶段的数据的统计计算,它重载后的参数也是三个,但是参数类型可能会和map不太一样。
主类代码如下:
package cn.itcast.hadoop.mr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
//构建job对象
Job job = Job.getInstance(new Configuration());
//main方法所在类
job.setJarByClass(WordCount.class);
//设置mapper相关对象
job.setMapperClass(WCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job, new Path("/words.txt"));
//设置reducer相关对象
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, new Path("/root/wcout"));
//提交任务
job.waitForCompletion(true);
}
//构建job对象
Job job = Job.getInstance(new Configuration());
//main方法所在类
job.setJarByClass(WordCount.class);
//设置mapper相关对象
job.setMapperClass(WCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job, new Path("/words.txt"));
//设置reducer相关对象
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, new Path("/root/wcout"));
//提交任务
job.waitForCompletion(true);
}
}
map类代码:
package cn.itcast.hadoop.mr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WCMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
@Override
protected void map(LongWritable key,Text value,Context context ) throws IOException,InterruptedException{
//接收数据
String line = value.toString();
//切分数据
String[] words=line.split(" ");
//循环
for(String w : words){
//出现一次,记一次。输出
context.write(new Text(w), new LongWritable(1));
}
}
protected void map(LongWritable key,Text value,Context context ) throws IOException,InterruptedException{
//接收数据
String line = value.toString();
//切分数据
String[] words=line.split(" ");
//循环
for(String w : words){
//出现一次,记一次。输出
context.write(new Text(w), new LongWritable(1));
}
}
}
reduce代码:
package cn.itcast.hadoop.mr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WCReducer extends Reducer<Text,LongWritable,Text,LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> value2,Context context)
throws IOException, InterruptedException {
//接收数据
//Text k3 =k2;
//定义一个计数器
long counter=0;
//循环v2s
for(LongWritable i : value2){
counter+=i.get();
//i跟counter类型不同,所以需要调用get()
}
//输出
context.write(key,new LongWritable(counter));
}
protected void reduce(Text key, Iterable<LongWritable> value2,Context context)
throws IOException, InterruptedException {
//接收数据
//Text k3 =k2;
//定义一个计数器
long counter=0;
//循环v2s
for(LongWritable i : value2){
counter+=i.get();
//i跟counter类型不同,所以需要调用get()
}
//输出
context.write(key,new LongWritable(counter));
}
}