05 分布式处理框架MapReduce

最新推荐文章于 2022-11-27 12:31:54 发布

ZFH__ZJ

最新推荐文章于 2022-11-27 12:31:54 发布

阅读量1.3k

点赞数

分类专栏：老文章文章标签： hadoop

本文链接：https://blog.csdn.net/ZJ__ZFH/article/details/79610048

版权

老文章专栏收录该内容

30 篇文章 0 订阅

订阅专栏

分布式处理框架MapReduce

MapReduce概述

源自于Google的MapReduce论文，论文发表于2004年12月
Hadoop MapReduce 是Google MapReduce的克隆版
MapReduce优点：海量数据离线处理，易开发，易运行
MapReduce缺点：实时流式计算

MapReduce编程模型

例子，worldcount

统计文件中每个单词出现的次数
需求：求worldcount，文件很大，如何解决大数据量的统计分析
解决：借助于分布式计算框架来解决，比如MapReduce

worldcount执行流程

这里写图片描述
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Map和Reduce阶段

作业会被拆分成Map阶段和Reduce阶段
Map阶段：Map Tasks
Reduce阶段：Reduce Tasks

执行步骤

准备map处理的输入数据
Mapper处理
Shuffle
Reduce处理
结果输出
Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

这里写图片描述

其中涉及了几个类，分别是

涉及到的类或者接口

Writable

Any key or value type in the Hadoop Map-Reduce framework implements this interface
两个方法：

/**
 * Serialize the fields of this object to <code>out</code>.
 */
void write(DataOutput out) throws IOException;
/**
 * Deserialize the fields of this object from <code>in</code>. 
 */  
void readFields(DataInput in) throws IOException;

demo

public class MyWritable implements Writable {
        // Some data     
        private int counter;
        private long timestamp;
        
        public void write(DataOutput out) throws IOException {
          out.writeInt(counter);
          out.writeLong(timestamp);
        }
        
        public void readFields(DataInput in) throws IOException {
          counter = in.readInt();
          timestamp = in.readLong();
        }
        
        public static MyWritable read(DataInput in) throws IOException {
          MyWritable w = new MyWritable();
          w.readFields(in);
          return w;
        }
      }

WritableComparable

Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
Note that hashCode() is frequently used in Hadoop to partition keys. It’s important that your implementation of hashCode() returns the same result across different instances of the JVM. Note also that the default hashCode() implementation in Object does not satisfy this property.
demo

 public class MyWritableComparable implements WritableComparable<MyWritableComparable> {
        // Some data
        private int counter;
        private long timestamp;
        
        public void write(DataOutput out) throws IOException {
          out.writeInt(counter);
          out.writeLong(timestamp);
        }
        
        public void readFields(DataInput in) throws IOException {
          counter = in.readInt();
          timestamp = in.readLong();
        }
        
        public int compareTo(MyWritableComparable o) {
          int thisValue = this.value;
          int thatValue = o.value;
          return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
        }
 
        public int hashCode() {
          final int prime = 31;
          int result = 1;
          result = prime * result + counter;
          result = prime * result + (int) (timestamp ^ (timestamp &gt;&gt;&gt; 32));
          return result
        }
      }

MapReduce核心概念

Split：交由MapReduce作业来处理的数据块，是MapReduce中最小的计算单元，HDFS的blocksize是HDFS中最小的存储单元，128M。默认情况下，他们两是一一对应的，当然也可以手工设置他们之间的关系
InputFormat：将输入数据进行分片:
OutputFormat：输出
Combiner：
Partitioner：

MapReduce架构

MapReduce编程

ZFH__ZJ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
05 分布式处理框架MapReduce

分布式处理框架MapReduceMapReduce概述源自于Google的MapReduce论文，论文发表于2004年12月 Hadoop MapReduce 是Google MapReduce的克隆版 MapReduce优点：海量数据离线处理，易开发，易运行 MapReduce缺点：实时流式计算MapReduce编程模型例子，worldcount统计文件中每...
复制链接

扫一扫

专栏目录