MapReduce体系结构

最新推荐文章于 2024-07-29 08:38:53 发布

wealon

最新推荐文章于 2024-07-29 08:38:53 发布

阅读量2.1k

点赞数

分类专栏： MapReduce hadoop

本文链接：https://blog.csdn.net/wealon/article/details/41924517

版权

hadoop 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

MapReduce

1 篇文章 0 订阅

订阅专栏

MapReduce体系结构

★ MapReduce的原理

MapReduce是一种分布式的计算模型，用于解决大数据的计算问题。

MapReduce由两阶段组成，即Map阶段和Reduce阶段，用户只需要实现map()与reduce()两个函数。

★ MapReduce执行过程

包括两大任务，如下Map任务和Reduce任务。

▲ Map任务步骤：

M1.读取输入文件的内容，把输入文件的内容解析成key-value对。解析的时候，解析单元是一行，把每一行解析成对应的key-value对。每一个键值对调用一次map函数。

M 2.在map函数中，对输入的key-value进行处理，转换成新的key-value对并输出。

M 3.对输出的key-value进行分区。就是map后的有几个输出分支。(Partition)

M 4.对不同分区的数据，按照key进行排序、分组。相同key的value放到一个集合中。

M 5.对分组后的数据进行归约（可选）(Combiner)

▲ Reduce任务的步骤：

R1.对多个map任务的输出，按照不同的分区，通过网络复制到不同的reduce节点。

R 2.对多个map任务的输出进行合并、排序。对输入的key-value处理，转换成新的key-value输出。

R 3.把reduce的输出保存到文件中

★ 第一个MR例子

以官方Demo中的wordcount为例，写一个统计一个文件或多个文件中单词出现的个数的例子。

package com.broader.mr;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

/**

* @ClassName: FirstMR

* @Description: 第一个MapReduce程序，处理计算一个文件中单词出现的个数

* @author wealon wealondatou@gmail.com

* @date 2014-7-17 下午12:58:05

public classFirstMR {

private static final String INPUT_PATH = "hdfs://hadooptest:9000/wealon";

private static final String OUTPUT_PATH = "hdfs://hadooptest:9000/out4";

public static void main(String[] args) throws Exception {

Jobjob = newJob(newConfiguration(), "FirstMR");

//1. 设置输入文件的位置

//注：这儿用到的是setInputPaths方法，是否可以用add?

FileInputFormat.setInputPaths(job,INPUT_PATH);

//指定如何对输入的文件进行格式化

job.setInputFormatClass(TextInputFormat.class);

//2 . 设置map相关的参数,指定自定义的Map类

job.setMapperClass(MyMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(LongWritable.class);

//3. 设定分区

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

//4 . 排序分组

//5 . 规约

//设定reduce相关参数 1.指定自定义的Reducer类

job.setReducerClass(MyReduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

//2.设定输出文件的路径

FileOutputFormat.setOutputPath(job,newPath(OUTPUT_PATH));

job.setOutputFormatClass(TextOutputFormat.class);

//启动job

job.waitForCompletion(true);

}

/**

* @ClassName: MyMapper

* @Description: 处理文本输入的每一行数据

* KEYIN, 输入的行的偏移量

* VALUEIN, 输入的每行的内容

* KEYOUT, 输出的每个单词

* VALUEOUT，输出的每个单词的数量，默认是1

static class MyMapper extends Mapper<LongWritable,Text, Text, LongWritable> {

/**

* key:偏移量

* value:该行文本的数据

* context : 上下文对象

protected void map(LongWritablekey,Text value,

org.apache.hadoop.mapreduce.Mapper<LongWritable,Text, Text, LongWritable>.Context context)

throws java.io.IOException,InterruptedException {

//这里处理的是文本中每一行的数据

Stringline = value.toString();

System.out.println("key:" + key +"=======line:"+ line);

String[]words = line.split("\t");

for (String word : words) {

context.write(new Text(word), new LongWritable(1));

}

};

}

/**

* @ClassName: MyReduce

* @Description: Reduce函数，处理map函数的输出

* KEYIN, map输出的每个单词

* VALUEIN, map输出的每个单词的数量集合

* KEYOUT, 表示的是文本中出现的不同的单词

* VALUEOUT，表示的是文本中出现的不同单词的个数

static class MyReduce extends Reducer<Text,LongWritable, Text, LongWritable>{

/**

* key: 在map阶段分组后的key,即分组后的每个单词

* values: 在map阶段分组后的value的集合

* context ：上下文对象

protected void reduce(Text key,java.lang.Iterable<LongWritable> values,

org.apache.hadoop.mapreduce.Reducer<Text,LongWritable,Text,LongWritable>.Contextcontext)

throws java.io.IOException,InterruptedException {

long sum = 0;

StringBuffersb = newStringBuffer();

//处理输入

for (LongWritable value :values) {

//value.get() 把LongWritable类型转换为long 类型

sum+= value.get();

sb.append(value.toString()).append(",");

}

System.out.println("key:"+ key.toString()+ "========values:" + sb.toString());

context.write(key,newLongWritable(sum));

};

}

如果输入的文件内容如下：

hello world

hello hadoop

程序的输出结果是：

hello 2

world 1

hadoop 1

权限问题：运行时，如果报权限问题，有两种解决方法。

第一：打包成jar包，上传到虚拟机运行。

用eclipse的Export功能，把代码打包成jar包。

上传到虚拟机，执行

hadoop jar name.jarparam1 param2

第二：修改文件中对权限的判断。

修改之前的方法签名：

修改后的方法签名：

★ 数据类型与格式

Hadoop中的数据类型与Java中的数据类型有相互对应的关系。

如：

Hadoop类型	Java类型
Text	String
IntWritable	int/Integer
LongWritable	long/Long
BooleanWritable	Boolean

两种数据类型之间的转换：

java类型到Hadoop类型：通过构造方法或调用对应Hadoop类型的set方法

Hadoop类型到Java类型：通过调用Hadoop类型的get方法或Text的toString方法

注：

Hadoop的数据类型都实现了Writable接口。

▲ 自定义数据类型：

对于无法满足业务需要时，需要自定义数据类型。

第一步：实现Writable接口

第二步：实现Writable接口的方法如下：

public void readFields(DataInput in) throws IOException {}

public void write(DataOutputout) throws IOException {}

举例如下：

/**

* @Title: PhoneWritable.java

* @Package com.broader.mr

package com.broader.mr;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.Writable;

/**

* @ClassName: PhoneWritable

* @Description: 自定义数据类型，用来处理显示手机流量特殊的数据格式

* @author wealon wealondatou@gmail.com

* @date 2014-7-17 下午6:29:18

public classPhoneWritable implements Writable {

long msisdn ;

long upPackNum ;

long downPackNum ;

long upPayLoad ;

long downPayLoad ;

public PhoneWritable() {}

public PhoneWritable(Stringmsisdn,String upPackNum,String downPackNum,String upPayLoad,String downPayLoad){

this.msisdn = Long.parseLong(msisdn);

this.upPackNum = Long.parseLong(upPackNum);

this.downPackNum = Long.parseLong(downPackNum);

this.upPayLoad = Long.parseLong(upPayLoad);

this.downPayLoad = Long.parseLong(downPayLoad);

}

@Override

public void write(DataOutput out) throws IOException {

out.writeLong(msisdn);

out.writeLong(upPackNum);

out.writeLong(downPackNum);

out.writeLong(upPayLoad);

out.writeLong(downPayLoad);

}

@Override

public void readFields(DataInputin) throwsIOException {

this.msisdn = in.readLong();

this.upPackNum = in.readLong();

this.downPackNum = in.readLong();

this.upPayLoad = in.readLong();

this.downPayLoad = in.readLong();

}

@Override

public String toString() {

return "upPackNum=" + this.upPackNum +

" downPackNum="+ this.downPackNum +

" upPayLoad="+ this.upPayLoad +

" downPayLoad="+ this.downPayLoad ;

}

★ Writable接口与序列化机制

Hadoop中的所有基本数据类型均实现了Writable接口,并实现如下方法：

public void readFields(DataInputin) throws IOException {}

public void write(DataOutputout) throws IOException {}

Hadoop的序列化就是通过该Writable接口来实现的；

Java中的序列化是通过java.io.Serializable接口来实现的。

所谓的序列化，就是把结构化的对象转化为字节流；

反序列化，就是序列化的一个逆过程。

MR的任意key- value必须实现Writable接口

MR的任意key必须实现WritableComparable接口

★ Hadoop新旧API比较

新版本指的是Hadoop1.x

旧版本指的是Hadoop0.x

1.包的不同

Hadoop旧的API一般在mapred包中

Hadoop新的API一般在mapreduce包中

2.旧API使用的是JobConf对象来描述一个对象

新API使用的是Job对象来描述一个对象

3.作业的提交方式

旧API使用的是jobClient.runJob

新API使用的是job.waitForCompletion()

4.接口不同

旧的API继承MapReduceBase类，实现Mapper接口

新的API直接继承Mapper实现Mapper类

5.输出文件命名方式不同

中间没有r字符

★ Hadoop计数器

自定义定数器与框架自带的计数器。如下是一个对例子的分析：

MapReduce的输入文件内容如下，包括有两个文件

file1:

hello world

hello itcast

file2:

welcome star

welcome dragon

------------------------------------------------------------------------------------

Counters: 20

File Input Format Counters

Bytes Read=53 //输入的文件的大小

File Output Format Counters

Bytes Written=51 //输出文件的大小

FileSystemCounters //文件系统计数器，包括HDFS中的计数器和LINUX文件系统的计数器

FILE_BYTES_READ=1310

HDFS_BYTES_READ=134

FILE_BYTES_WRITTEN=190436

HDFS_BYTES_WRITTEN=51

Map-Reduce Framework //MapReduce框架的计数器

Map output materialized bytes=145

Map input records=4 //读取的记录行，读取了4行

Reduce shuffle bytes=0

Spilled Records=16 //切分的记录数

Map output bytes=117 //Map的输出

Total committed heap usage (bytes)=482291712

Map input bytes=53 //输入的字节数

SPLIT_RAW_BYTES=184

Combine input records=0 //组合输入的记录数

Reduce input records=8 //Reduce的输入

Reduce input groups=6 //Reduce输入的组数

Combine output records=0 //组合输出的记录数

Reduce output records=6 //Reduce的输出记录数

Map output records=8 //Map输出的记录行，输出了8行

以下代码用来统计文件中Tab出现的次数，当出现Tab符号的时候，自定义计数器加1

代码片段：

/**

* String groupName 计数器的组名

* String counterName 计数器名称

Countercount = context.getCounter("idCounter", "tab-sum");

if(line.contains("\t")){

//自动加1

count.increment(1L);

}

★ Hadoop 打包jar包

除了可以直接在eclipse中运行后，也可以把程序打包成jar包在linux上直接通过命令运行。

下面的介绍如何打jar包和如何在linux下运行。

1.程序准备：

要使程序打包成jar包在linux下运行，在设置Job的时候，要特别加入如下代码：

2.打包jar包

3.输入打包成的jar包的名字

4.把jar包上传到linux(略)

5.用如下命令运行jar包

# hadoop jar paramr.jarhdfs://hadooptest:9000/input/ hdfs://hadooptest:9000/out3

如果程序正确，则显示MapReduce的运行过程，显示计数器的日志信息。

★ Combiner归约

Combiner操作是对map输出结果的一次合并，因为Combiner是在map端的操作，合并后可以减少传输到Reducer的数据量。因为一般Map端与Reducer端是不在同一个节点上的，Map端的输出是要传输到Reducer端的。

Combiner最基本是实现本地key的合并。

Combiner实际上是完成了数据在本地聚合，提升了程序运行的效率。

Combiner的输出是Reducer的输入，所以，Combiner绝不能改变最终的计算结果。

Combiner相当于Map端的Reducer，所以Combiner继承Reducer接口。

如下：

static class MyCombiner extends Reducer<Text,LongWritable, Text, LongWritable>{

protected void reduce(Text key,java.lang.Iterable<LongWritable> values,

org.apache.hadoop.mapreduce.Reducer<Text,LongWritable,Text,LongWritable>.Contextcontext)

throws java.io.IOException,InterruptedException {

//省略。。。。。

};

}

实现方法中的写法类似于Reducer的写法。

设置Job参数的一些实现Combiner

★ Partition分区

自定义分区：流量小汇总时，手机号输出到一个文件，非手机号输出到一个文件。

Partition是HashPartition的基类。

HashPartition是Hadoop的默认partitioner

Partion有相关的计算方法

★ 排序算法

原始数据：

3 3

3 2

3 1

2 1

2 2

1 1

要求排序后结果：

第一列升序，第二列降序或长序

参见例子：com.broader.mr.SortMR

★ 分组算法

原始数据：

3 3

3 2

3 1

2 1

2 2

1 1

--------------------------------------------------------------------

预期输出结果：

3 1

2 1

1 1

--------------------------------------------------------------------

思路：对第一列分组，在分组结果中，输出values中值最小的。

实现：

//设置分组的类

job.setGroupingComparatorClass(MyGroup.class);

分组类实现：

package com.broader.mr;

import org.apache.hadoop.io.RawComparator;

import org.apache.hadoop.io.WritableComparator;

public classMyGroup implementsRawComparator<MyKey>{

@Override

public int compare(MyKey o1, MyKeyo2) {

System.out.println("MyGroup compare...");

return (int) (o1.first - o2.first);

}

@Override

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

//TODO不懂为什么 return WritableComparator.compareBytes(b1, s1, 8, b2, s2, 8);

}

参见：com.broader.mr.GroupMR

★ 常用算法

TOPKEY

★ 案例分析：

分析Hadoop安装包中的官方示例。

▲ 示例一：WordCount

位置：

\src\examples\org\apache\hadoop\examples\WordCount.java

Map函数

Reducer类

Driver类

注意：在红框中，设置了CombinerClass为IntSumReducer。Combiner是属于Map阶段的操作，被称作为Map端的reduce，执行Combiner可以有效的减少通过网络从Map到Reducer的数据，从而提高程序运行的效率。

▲ 示例二：SecondrySort

This is an example Hadoop Map/Reduceapplication. It reads the text input files that must contain two integers per aline. The output is sorted by the first and second number and grouped on thefirst number.

输入文件内容如下：

输出内容如下：

● 1.定义IntPair，用来作为新的K来作比较

public staticclassIntPair implementsWritableComparable<IntPair> {

private int first = 0;

private int second = 0;

/**

* Set the left and right values.

public void set(int left, int right) {

first = left;

second = right;

}

public int getFirst() {

return first;

}

public int getSecond() {

return second;

}

/**

* Read the two integers. Encoded as: MIN_VALUE-> 0, 0 ->-MIN_VALUE,

* MAX_VALUE->-1

@Override

public void readFields(DataInputin) throwsIOException {

first = in.readInt() +Integer.MIN_VALUE;

second = in.readInt() +Integer.MIN_VALUE;

}

@Override

public void write(DataOutput out) throws IOException {

out.writeInt(first - Integer.MIN_VALUE);

out.writeInt(second - Integer.MIN_VALUE);

}

@Override

public int hashCode() {

return first * 157 + second;

}

@Override

public boolean equals(Object right) {

if (right instanceof IntPair) {

IntPairr = (IntPair) right;

return r.first == first && r.second == second;

}else{

return false;

}

/** A Comparator that compares serialized IntPair. */

public static class Comparator extends WritableComparator {

public Comparator() {

super(IntPair.class);

}

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,

int l2) {

return compareBytes(b1,s1, l1, b2, s2, l2);

}

static { // register this comparator

WritableComparator.define(IntPair.class, new Comparator());

}

// 这样子写真心不错，程序设计的不错

@Override

public int compareTo(IntPair o) {

if (first != o.first) {

return first < o.first ? -1 : 1;

}elseif(second!= o.second){

return second < o.second ? -1 : 1;

}else{

return 0;

}

● 2.定义FirstPartitioner

/**

* Partition based on the first part of thepair.

public static class FirstPartitioner extends

Partitioner<IntPair,IntWritable> {

@Override

public int getPartition(IntPairkey, IntWritable value,

int numPartitions) {

return Math.abs(key.getFirst()* 127) % numPartitions;

}

● 3.定义FirstGroupingComparator

/**

* Compare only the first part of the pair, sothat reduce is called once

* for each value of the first part.

public static class FirstGroupingComparatorimplements

RawComparator<IntPair>{

@Override

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

return WritableComparator.compareBytes(b1,s1, Integer.SIZE/ 8,

b2,s2, Integer.SIZE/ 8);

}

@Override

public int compare(IntPair o1,IntPair o2) {

int l = o1.getFirst();

int r = o2.getFirst();

return l == r ? 0 : (l < r? -1 : 1);

}

● 4.定义MapClass

/**

* Read two integers from each line andgenerate a key, value pair as

* ((left, right), right).

public static class MapClass extends

Mapper<LongWritable,Text, IntPair, IntWritable> {

private final IntPair key = new IntPair();

private final IntWritable value = new IntWritable();

@Override

public void map(LongWritable inKey,Text inValue, Context context)

throws IOException,InterruptedException {

StringTokenizeritr = newStringTokenizer(inValue.toString());

int left = 0;

int right = 0;

if (itr.hasMoreTokens()) {

left= Integer.parseInt(itr.nextToken());

if (itr.hasMoreTokens()) {

right= Integer.parseInt(itr.nextToken());

}

key.set(left, right);

value.set(right);

context.write(key, value);

}

● 5.定义ReduceClass

/**

* A reducer class that just emits the sum ofthe input values.

public static class Reduce extends

Reducer<IntPair,IntWritable, Text, IntWritable> {

private static final Text SEPARATOR = new Text(

"------------------------------------------------");

private final Text first = new Text();

@Override

public void reduce(IntPair key,Iterable<IntWritable> values,

Contextcontext) throwsIOException, InterruptedException {

context.write(SEPARATOR, null);

first.set(Integer.toString(key.getFirst()));

for (IntWritable value :values) {

context.write(first, value);

}

● 6.定义驱动类

public static void main(String[] args) throws Exception {

Configurationconf = newConfiguration();

// String[] otherArgs = new GenericOptionsParser(conf,

// args).getRemainingArgs();

// if (otherArgs.length != 2) {

// System.err.println("Usage: secondarysrot <in><out>");

// System.exit(2);

// }

Jobjob = newJob(conf, "secondary sort");

job.setJarByClass(SecondarySort.class);

job.setMapperClass(MapClass.class);

job.setReducerClass(Reduce.class);

// group and partition by the first int in thepair

job.setPartitionerClass(FirstPartitioner.class);

job.setGroupingComparatorClass(FirstGroupingComparator.class);

// the map output is IntPair, IntWritable

job.setMapOutputKeyClass(IntPair.class);

job.setMapOutputValueClass(IntWritable.class);

// the reduce output is Text, IntWritable

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(

"hdfs://hadooptest:9000/secondsort"));

FileOutputFormat.setOutputPath(job,newPath(

"hdfs://hadooptest:9000/out2"));

// FileInputFormat.addInputPath(job, newPath(otherArgs[0]));

// FileOutputFormat.setOutputPath(job, newPath(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

● 注意

1：示例2应用了InitPair，实现了WritableComparable，在Hadoop中所有的作为key的类型都实现了该接口。具体可以参见LongWritable的实现。

2：分区的定义

自定义分区继承Partitioner类，实现敷衍Partitions()方法。该实例中根据IntPair中的first来返回Partition

3：FirstGroupingComparator实现RawComparator，

在比较的时候，是对字节内容进行比较。

wealon

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录