分布式处理框架MapReduce
MapReduce概述
源自于Google的MapReduce论文,论文发表于2004年12月
Hadoop MapReduce 是Google MapReduce的克隆版
MapReduce优点:海量数据离线处理,易开发,易运行
MapReduce缺点:实时流式计算
MapReduce编程模型
例子,worldcount
统计文件中每个单词出现的次数
需求:求worldcount,文件很大,如何解决大数据量的统计分析
解决:借助于分布式计算框架来解决,比如MapReduce
worldcount执行流程
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Map和Reduce阶段
作业会被拆分成Map阶段和Reduce阶段
Map阶段:Map Tasks
Reduce阶段:Reduce Tasks
执行步骤
准备map处理的输入数据
Mapper处理
Shuffle
Reduce处理
结果输出
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
其中涉及了几个类,分别是
涉及到的类或者接口
Writable
Any key
or value
type in the Hadoop Map-Reduce framework implements this interface
两个方法:
/**
* Serialize the fields of this object to <code>out</code>.
*/
void write(DataOutput out) throws IOException;
/**
* Deserialize the fields of this object from <code>in</code>.
*/
void readFields(DataInput in) throws IOException;
demo
public class MyWritable implements Writable {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public static MyWritable read(DataInput in) throws IOException {
MyWritable w = new MyWritable();
w.readFields(in);
return w;
}
}
WritableComparable
Any type which is to be used as a key
in the Hadoop Map-Reduce framework should implement this interface.
Note that hashCode()
is frequently used in Hadoop to partition keys. It’s important that your implementation of hashCode() returns the same result across different instances of the JVM. Note also that the default hashCode()
implementation in Object
does not satisfy this property.
demo
public class MyWritableComparable implements WritableComparable<MyWritableComparable> {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable o) {
int thisValue = this.value;
int thatValue = o.value;
return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + counter;
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
return result
}
}
MapReduce核心概念
Split:交由MapReduce作业来处理的数据块,是MapReduce中最小的计算单元,HDFS的blocksize是HDFS中最小的存储单元,128M。默认情况下,他们两是一一对应的,当然也可以手工设置他们之间的关系
InputFormat:将输入数据进行分片:
OutputFormat:输出
Combiner:
Partitioner: