05 分布式处理框架MapReduce

分布式处理框架MapReduce

MapReduce概述

源自于Google的MapReduce论文,论文发表于2004年12月
Hadoop MapReduce 是Google MapReduce的克隆版
MapReduce优点:海量数据离线处理,易开发,易运行
MapReduce缺点:实时流式计算

MapReduce编程模型

例子,worldcount

统计文件中每个单词出现的次数
需求:求worldcount,文件很大,如何解决大数据量的统计分析
解决:借助于分布式计算框架来解决,比如MapReduce

worldcount执行流程

这里写图片描述
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Map和Reduce阶段

作业会被拆分成Map阶段和Reduce阶段
Map阶段:Map Tasks
Reduce阶段:Reduce Tasks

执行步骤

准备map处理的输入数据
Mapper处理
Shuffle
Reduce处理
结果输出
Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

这里写图片描述

其中涉及了几个类,分别是

涉及到的类或者接口
Writable

Any key or value type in the Hadoop Map-Reduce framework implements this interface
两个方法:

/**
 * Serialize the fields of this object to <code>out</code>.
 */
void write(DataOutput out) throws IOException;
/**
 * Deserialize the fields of this object from <code>in</code>. 
 */  
void readFields(DataInput in) throws IOException;

demo

public class MyWritable implements Writable {
        // Some data     
        private int counter;
        private long timestamp;
        
        public void write(DataOutput out) throws IOException {
          out.writeInt(counter);
          out.writeLong(timestamp);
        }
        
        public void readFields(DataInput in) throws IOException {
          counter = in.readInt();
          timestamp = in.readLong();
        }
        
        public static MyWritable read(DataInput in) throws IOException {
          MyWritable w = new MyWritable();
          w.readFields(in);
          return w;
        }
      }
WritableComparable

Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
Note that hashCode() is frequently used in Hadoop to partition keys. It’s important that your implementation of hashCode() returns the same result across different instances of the JVM. Note also that the default hashCode() implementation in Object does not satisfy this property.
demo

 public class MyWritableComparable implements WritableComparable<MyWritableComparable> {
        // Some data
        private int counter;
        private long timestamp;
        
        public void write(DataOutput out) throws IOException {
          out.writeInt(counter);
          out.writeLong(timestamp);
        }
        
        public void readFields(DataInput in) throws IOException {
          counter = in.readInt();
          timestamp = in.readLong();
        }
        
        public int compareTo(MyWritableComparable o) {
          int thisValue = this.value;
          int thatValue = o.value;
          return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
        }
 
        public int hashCode() {
          final int prime = 31;
          int result = 1;
          result = prime * result + counter;
          result = prime * result + (int) (timestamp ^ (timestamp &gt;&gt;&gt; 32));
          return result
        }
      }
MapReduce核心概念

Split:交由MapReduce作业来处理的数据块,是MapReduce中最小的计算单元,HDFS的blocksize是HDFS中最小的存储单元,128M。默认情况下,他们两是一一对应的,当然也可以手工设置他们之间的关系
InputFormat:将输入数据进行分片:
OutputFormat:输出
Combiner:
Partitioner:

MapReduce架构

MapReduce编程

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值