MapReduce案例
Mapper阶段:
- 用户定义的Mapper要继承Mapper父类;
- Mapper的输入数据是KV键值对形式(K、V类型自定义)K为偏移量;
- Mapper的业务逻辑写在重写的map( )方法中;
- Mapper的输出数据是KV对的形式(K、V类型自定义);
- map( )方法(MapTask进程)对每一个<K,V>调用一次;
Reducer阶段:
- 用户定义的Reducer要继承Reducer父类;
- Reducer的输入数据类型对应Mapper的输出数据类型,也是KV对
- Reducer的业务逻辑写在reduce( )方法中
- ReduceTask进程对每一组相同的<K,V>组调用一次reduce( )方法
Driver阶段:
相当于Yarn集群的客户端,提交的是封装了MapReduce程序相关运行参数的job对象
本地测试WordCount
注意点:
- Mapper和Reducer导入类包,类包适用于Hadoop2.x和Hadoop3.x,而接口包适用于Hadoop1.x;
- Mapper重写的map( )方法对于每个键值对,都会调用一次;
- Reducer重写的reduce( ) 方法只会对相同的key调用一次;
- 由于方法的多次调用,因此将outK和outV作为属性,避免资源的浪费
- Hadoop中Text对于Java中的String,xxxWritable对于Java中的xxx,因此使用时需要进行类型转换,可以将xxxWritable理解为包装类。
WordCountMapper:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/*
Mapper导入类包,适用于2.x和3.x,接口包用于1.x
<KEYIN:map阶段输入的key的类型:LongWritable
VALUEIN:map阶段输入value类型,Text
KEYOUT:map阶段输出的key类型Text
VALUEOUT:map阶段输出的value类型IntWritable
Context:上下文,用于map、reduce、系统之间的通信
*/
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outK = new Text();
private IntWritable outV = new IntWritable(1);
@Override//每一个键值对调用一次
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//1.获取一行
//atguigu atguigu
String line = value.toString();
//2.切割
//atguigu
//atguigu
String[] words = line.split(" ");
//3.循环写出
for (String word : words) {
//写出
outK.set(word);
context.write(outK, outV);
}
}
}
WordCountReducer:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/*
Reducer导入类包,适用于2.x和3.x,接口包用于1.x
Map的输出是Reducer的输入
<KEYIN:reduce阶段输入的key的类型:LongWritable
VALUEIN:reduce阶段输入value类型,Text
KEYOUT:reduce阶段输出的key类型Text
VALUEOUT:reduce阶段输出的value类型IntWritable
*/
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable outV = new IntWritable();
@Override//每一个相同的你key被调用一次
protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int count = 0;
//atguigu,(1,1)
for (IntWritable value : values) {
count += value.get();
}
outV.set(count);
context.write(key, outV);
}
}
WordCountDriver:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1.获取job
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//2.设置jar包路径
job.setJarByClass(WordCountDriver.class);
//3.关联mapper和reducer
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//4.设置map输出的k,v类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//5.设置最终输出的k,v类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//6.设置输入和输出路径
FileInputFormat.setInputPaths(job,new Path("E:\\temp1\\read"));
FileOutputFormat.setOutputPath(job,new Path("E:\\temp1\\output1"));
//7.提交job
boolean result = job.waitForCompletion(true);
//8.退出
System.exit(result?0:1);
}
}
实际工程中,多为在本地搭建Hadoop环境后进行测试,再打包上传至虚拟机运行。因此需要对上述WordCountDriver代码做出如下更改,用于动态传入Hadoop的输入地址与输出地址:
//6.设置输入和输出路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
为防止Hadoop缺少必要的依赖,一般添加plugin将依赖一同打包:
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
文件上传后即可运行。
hadoop jar jar包 WordCountDriver引用地址 /输入地址 /输出地址