###MapReduce编程模型
- 一种分布式计算模型,解决海量数据的计算问题
- MapReduce将整个并行计算过程抽象到两个函数
- Map(映射):对一些独立元素组成的的列表的每一个元素进行指定的操作,可以高度并行。
- Reduce(化简):对一个列表的元素进行合并。
- 一个简单的MapReduce程序只需要指定map(),reduce(),input和output,剩下的事由框架完成。
-
MapReduce 将作业的整个运行过程分为两个阶段:Map阶段和Reduce阶段
-
Map阶段由一定数量的Map Task 组成
- 输入数据格式解析; InputFormat
- 输入数据处理:Mapper
- 数据分组:Partitioner
-
Reduce阶段由一定数量的Reduce Task组成
- 数据远程拷贝
- 数据按照key排序
- 数据处理:Reducer
- 数据输出格式:OutputFormat
###编写MapReduce程序
- 基于MapReduce计算模型编写分布式并行程序非常简单,程序员的主要编码工作就是实现Map和Reduce函数。
- 其它的并行程序中的种种复杂问题,如分布式存储,工作调度,负载平衡,容错处理,网络通信等,均由YARN框架负责处理
input -> map -> reduce --> output
数据传输的流通格式
<key,value>
input: <key,value>
map
output: <key,value>
--------------------
input:<key,value>
reduce
output:<key,value>
hadoop yarn -> <0,hadoop yarn>
hadoop mapreduce -><11,hadoop mapreduce>
###MapReduce八股文
- MapReduce中,map和reduce函数遵循如下常规格式:
map:(K1,V1)-->list(K2,V2)
reduce:(K2,list(V2))-->list(K3,V3)
- Mapper的基类:
protected void map(KEY key,VALUE value,Context context)throws
IOException,InterruptedException{
}
- Reducer的基类:
protected void map(KEY key,Iterable<VALUE>,Context context)throws
IOException,InterruptedException{
}
- Context 是上下文对象
MapReduce程序:单词统计
WordCount.java
package xiankun_qin.xiangkun;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
//step 1 : Map CLass
/**
* Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
* @author xiangkun
* import org.apache.hadoop.io.Text;
*
*/
public static class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
private Text mapOutPutKey = new Text();
private final static IntWritable mapOutputValue = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.map(key, value, context);
//line value
String lineValue =value.toString();
//split
StringTokenizer stringTokenizer = new StringTokenizer(lineValue);
//iterator
while(stringTokenizer.hasMoreTokens()){
//get word value
String wordValue = stringTokenizer.nextToken();
//set value
mapOutPutKey.set(wordValue);
//output
context.write(mapOutPutKey, mapOutputValue);
}
}
}
//step 2: Reduce Class
/**
* Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
* @author xiangkun
*
*/
public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable outputValue = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.reduce(key, values, context);
//sum tmp
int sum =0;
for(IntWritable value:values){
//total
sum += value.get();
}
//setvalue
outputValue.set(sum);
//output
context.write(key, outputValue);
}
}
//step 3: Driver ,component job
public int run(String[] args) throws Exception{
//1.get confifuration
Configuration configuration =new Configuration();
//2.create job
Job job = Job.getInstance().getInstance(configuration, this.getClass().getSimpleName());
//run jar
job.setJarByClass(this.getClass());
//3.set job
//input -->map -->reduce -->output
//3.1 input
Path inPath = new Path(args[0]);
FileInputFormat.addInputPath(job, inPath);
//3.2 map
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//3.3: reduce
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//3.4output
Path outPath= new Path(args[1]);
FileOutputFormat.setOutputPath(job, outPath);
//4.submit job
boolean isSucess = job.waitForCompletion(true);
return isSucess?0:1;
}
//step 4 :run program
public static void main(String[] args) throws Exception {
int status = new WordCount().run(args);
System.exit(status);
}
}
MapReduce程序运行
- 整个工程打成jar包
- xiangkun@xiangkun-qin:/opt/modules/hadoop-2.5.0$ bin/yarn jar jars/my_wordcount.jar /user/xiangkun/mapreduce/Wordcont/input /user/xiangkun/mapreduce/wordcount/output0
###数据类型
-
数据类型都实现Writable接口,以便用这些类型定义的数据可以被序列化进行网络传输和文件存储。
-
基本数据类型
- BooleanWritable: 标准布尔型数值
- ByteWritable:单字节数值
- DoubleWritable:双字节数值
- FloatWritable:浮点数
- IntWritable:整形数
- LongWritable:长整型数
- Text:使用UTF-8格式存储德文本
- NullWritable:当(key,value)中的key或value为空时使用
-
Writable
-
write()是把每个对象序列化到输出流
-
readFieleds()是把输入流字节反序列化
-
WritableComparable
-
Java值对象的比较:
重写toString(),hashCode(),equals()方法