大数据知识(六)：Hadoop基础MapReduce详解(二)-CSDN博客

本文链接：https://blog.csdn.net/Oeljeklaus/article/details/80460995

Java序列化与Hadoop序列化之间的比较

Java序列化是一个很重的框架 (Serializable),一个对象被序列化以后会附带很多的额外信息，不便于网络中的高速传输。Hadoop开发了一套自己的序列化框架(Writable)：精简、高效。

以下代码比较两者之间的差别:

public class TestSeri {
	public static void main(String[] args) throws Exception {
		//定义两个ByteArrayOutputStream，用来接收不同序列化机制的序列化结果
		ByteArrayOutputStream ba = new ByteArrayOutputStream();
		ByteArrayOutputStream ba2 = new ByteArrayOutputStream();

		//定义两个DataOutputStream，用于将普通对象进行jdk标准序列化
		DataOutputStream dout = new DataOutputStream(ba);
		DataOutputStream dout2 = new DataOutputStream(ba2);
		ObjectOutputStream obout = new ObjectOutputStream(dout2);
		//定义两个bean，作为序列化的源对象
		ItemBeanSer itemBeanSer = new ItemBeanSer(1000L, 89.9f);
		ItemBean itemBean = new ItemBean(1000L, 89.9f);

		//用于比较String类型和Text类型的序列化差别
		Text atext = new Text("a");
		// atext.write(dout);
		itemBean.write(dout);

		byte[] byteArray = ba.toByteArray();

		//比较序列化结果
		System.out.println(byteArray.length);
		for (byte b : byteArray) {

			System.out.print(b);
			System.out.print(":");
		}

		System.out.println("-----------------------");

		String astr = "a";
		// dout2.writeUTF(astr);
		obout.writeObject(itemBeanSer);

		byte[] byteArray2 = ba2.toByteArray();
		System.out.println(byteArray2.length);
		for (byte b : byteArray2) {
			System.out.print(b);
			System.out.print(":");
		}
	}
}

自定义对象实现序列化

如果需要将自定义的bean放在key中传输，则还需要实现comparable接口，因为mapreduce框中的shuffle过程一定会对key进行排序,此时，自定义的bean实现的接口应该是：public class FlowBean implements WritableComparable<FlowBean>

	/**
	 * 反序列化的方法，反序列化时，从流中读取到的各个字段的顺序应该与序列化时写出去的顺序保持一致
	 */
	@Override
	public void readFields(DataInput in) throws IOException {
		
		upflow = in.readLong();
		dflow = in.readLong();
		sumflow = in.readLong();
		

	}

	/**
	 * 序列化的方法
	 */
	@Override
	public void write(DataOutput out) throws IOException {

		out.writeLong(upflow);
		out.writeLong(dflow);
		//可以考虑不序列化总流量，因为总流量是可以通过上行流量和下行流量计算出来的
		out.writeLong(sumflow);

	}
	
	@Override
	public int compareTo(FlowBean o) {
		
		//实现按照sumflow的大小倒序排序
		return sumflow>o.getSumflow()?-1:1;
	}

MapReduce和Yarn

Yarn概述

Yarn是一个资源调度平台，负责为运算程序提供服务器运算资源，相当于一个分布式的操作系统平台，而mapreduce等运算程序则相当于运行于操作系统之上的应用程序。

Yarn的重要概念

1、 yarn并不清楚用户提交的程序的运行机制

2、 yarn只提供运算资源的调度（用户程序向yarn申请资源，yarn就负责分配资源）

3、 yarn中的主管角色叫ResourceManager

4、 yarn中具体提供运算资源的角色叫NodeManager

5、这样一来，yarn其实就与运行的用户程序完全解耦，就意味着yarn上可以运行各种类型的分布式运算程序（mapreduce只是其中的一种），比如mapreduce、storm程序，spark程序，tez ……

6、所以，spark、storm等运算框架都可以整合在yarn上运行，只要他们各自的框架中有符合yarn规范的资源请求机制即可

7、 Yarn就成为一个通用的资源调度平台，从此，企业中以前存在的各种运算集群都可以整合在一个物理集群上，提高资源利用率，方便数据共享

Yarn中运行运算程序的框架

MapReduce程序的调度如下：

具体流程如下：

1.客户端提交MapReduce程序后，ResourceManager根据资源情况返回给客户端一个jobID，以及作业提交资源的stagingdir拼接称为一个新的路径。

2.客户端根据ResourceManager返回的路径将job.jar、job.xml和job.splits文件提交到这个路径，然后通知ResourceManager，资源文件提交完毕。

3.ResouceManager收到客户端资源提交完毕的通知后，初始化任务task，放入到任务调度队列中，NodeManager会定期的到任务调度队列中领取任务。

4.NodeManager领取到相关任务后，分配任务资源容器Container，在资源容器分配完成之后，客户端通知NodeManager启动一个程序MRAppMaster。

5.NodeManager收到任务之后，向ResouceManager请求运行MapTask程序的资源

6.在分配完MapTask程序之后，NodeManager领取到任务，MapTask运行完成后输出结果

7.ReduceTask程序启动，根据MapTask计算的结果，根据分区号在YarnChild上拉去计算结果再次计算，直到任务完成。

MapReduce实践篇

MapReduce中的分区partitioner

需求

根据归属地输出流量统计数据结果到不同文件，以便于在查询统计结果时可以定位到省级范围进行

分析

Mapreduce中会将map输出的kv对，按照相同key分组，然后分发给不同的reducetask

默认的分发规则为：根据key的hashcode%reducetask数来分发

所以：如果要按照我们自己的需求进行分组，则需要改写数据分发（分组）组件Partitioner

自定义一个CustomPartitioner继承抽象类：Partitioner

然后在job对象中，设置自定义partitioner：job.setPartitionerClass(CustomPartitioner.class)

实现

/**
 * 定义自己的从map到reduce之间的数据（分组）分发规则 按照手机号所属的省份来分发（分组）ProvincePartitioner
 * 默认的分组组件是HashPartitioner
 * 
 * @author
 * 
 */
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {

	static HashMap<String, Integer> provinceMap = new HashMap<String, Integer>();

	static {

		provinceMap.put("135", 0);
		provinceMap.put("136", 1);
		provinceMap.put("137", 2);
		provinceMap.put("138", 3);
		provinceMap.put("139", 4);

	}

	@Override
	public int getPartition(Text key, FlowBean value, int numPartitions) {

		Integer code = provinceMap.get(key.toString().substring(0, 3));

		return code == null ? 5 : code;
	}

}

MapReduce数据压缩

概念

这是mapreduce的一种优化策略：通过压缩编码对mapper或者reducer的输出进行压缩，以减少磁盘IO，提高MR程序运行速度（但相应增加了cpu运算负担）

1、 Mapreduce支持将map输出的结果或者reduce输出的结果进行压缩，以减少网络IO或最终输出数据的体积

2、压缩特性运用得当能提高性能，但运用不当也可能降低性能

3、基本原则：

运算密集型的job，少用压缩

IO密集型的job，多用压缩

MR支持的压缩编码

Reducer压缩输出

在配置参数或在代码中都可以设置reduce的输出压缩

1、在配置参数中设置

mapreduce.output.fileoutputformat.compress=false

mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

mapreduce.output.fileoutputformat.compress.type=RECORD

2、在代码中设置

Job job = Job.getInstance(conf);

FileOutputFormat.setCompressOutput(job, true);

FileOutputFormat.setOutputCompressorClass(job, (Class<? extends CompressionCodec>) Class.forName(""));

Mapper压缩输出

在配置参数或在代码中都可以设置reduce的输出压缩

1、在配置参数中设置

mapreduce.map.output.compress=false

mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

2、在代码中设置：

conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);

conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, GzipCodec.class, CompressionCodec.class);

压缩文件的读取

Hadoop自带的InputFormat类内置支持压缩文件的读取，比如TextInputformat类，在其initialize方法中：

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    //根据文件后缀名创建相应压缩编码的codec
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;	
      decompressor = CodecPool.getDecompressor(codec);
	  //判断是否属于可切片压缩编码类型
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
		 //如果是可切片压缩编码，则创建一个CompressedSplitLineReader读取压缩数据
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
		//如果是不可切片压缩编码，则创建一个SplitLineReader读取压缩数据，并将文件输入流转换成解压数据流传递给普通SplitLineReader读取
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
	   //如果不是压缩文件，则创建普通SplitLineReader读取数据
      in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);
      filePosition = fileIn;
    }