MapReduce
1、MapReduce原理
(先分析,在编程)
1、WordCount
2、Yarn平台调度
3、分许WordCount单词计数的数据流程
4、开发自己的WordCount程序
2、MapReduce编程
1、三种方式运行mr
- windows本地运行
- 打包成jar,上传linux 命令hadoop jar xxx
- windows打成jar,然后本地运行,通知linux的yarn
1.环境准备
环境:Java Eclipse
需要的jar包:
/root/training/hadoop-2.7.3/share/hadoop/common
/root/training/hadoop-2.7.3/share/hadoop/common/lib
/root/training/hadoop-2.7.3/share/hadoop/mapreduce
/root/training/hadoop-2.7.3/share/hadoop/mapreduce/lib
2、创建类
创建Mapper类WordCountMapper
创建Reducer类WordCountReducer
创建Main类WordCountMain
3、编写Mapper类WordCountMapper
1).继承Mapper类
public void class WordCountMapper extends Mapper{
}
2).指定输入输出类型(k1,v1)(k2,v2)
k1:LongWritable
v1:Text
k2:Text
v2:LongWritable
public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
}
3).重写map方法
在函数中 右键-->Source--> Override/Implements Methods-->选择map
4).整理格式,命名参数
protected void map(LongWritable k1, Text v1, Context context){
}
5).通过分词 写入上下文context
String data = v1.toString();
//通过空格分割
String[] words = data.split(" ");
for(String w:words){
context.write(new Text(w),new LongWritable(1));
}
4、编写Reducer类WordCountReducer
1).继承Reducer类
public void class WordCountReducer extends Reducer{
}
2).指定输入输出类型(k3,v3)(k4,v4)
k3:Text
v3:LongWritable
k4:Text
v2:LongWritable
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
}
3).重写reduce方法
在函数中 右键-->Source--> Override/Implements Methods-->选择reduce
4).整理格式,命名参数
protected void reduce(Text k3, LongWritable v3, Context context){
}
5). 写上下文context
long total = 0;
for(LongWritable v:v3){
total = total+v.get();
}
context.write(k3,new LongWritable(total));
5、编写主程序WordCountMain
1).创建一个job
//job = Mapper+Reducer
Job job = Job.getInstance(new Configuration());
}
2.指定job的入口
job.setJarByClass(WordCountMain.class);
3.指定mapper和输出(k2,v2)的数据类型
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
4.指定reducer
job.setReducerClass(WordCountReducer.class);
job.setReduceOutputKeyClass(Text.class);
job.setReduceOutputValueClass(LongWritable.class);
5.指定输入输出路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutPaths(job,new Path(args[1]));
6.执行任务
job.waitForCompletion(true);
6.生成jar包
右击package-->Emport-->JAR file-->保存jar路径(此处命名s1) -->选择Main class-->Finish
7.运行jar包
hadoop jar s1.jar /input/data.txt /output/w0919
MapReduce高级功能
1、序列化
2、排序
3、分区
4、合并
1. 序列化
Java序列化:
核心:实现Serializable接口
如果一个类实现了Java序列化接口(Serializable),这个类对象可以作为InputStream和OutputStream对象(没有方法的接口--标准接口)
IO:序列化:内存到磁盘 反序列化:磁盘到内存
jav