MapReduce

最新推荐文章于 2023-06-12 08:08:44 发布

跟浩哥学大数据

最新推荐文章于 2023-06-12 08:08:44 发布

阅读量543

点赞数

分类专栏： MapReduce 文章标签： big data mapreduce

本文链接：https://blog.csdn.net/qq_34911436/article/details/121569758

版权

MapReduce 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

MapReduce 知识点

一、MapReduce 的核心思想

1、Map reduce 分布式运算程序一般分成两个阶段：Map阶段和 Reduce 阶段

2、在第一阶段（Map 阶段）所有的mapTask，都是完全并行执行，彼此互不干扰

3、在第二阶段（Reduce 阶段）所有的Reduce Task ,都是完全并行执行，彼此互不干扰，但是Reduce task 完全依赖于上一个阶段，即所有MapTask 并发实例的输出。

二、MapReduce 编程规范

1、Mapper 阶段
 自定义的map类要继承Mapper<,,,>
 Map() 方法（MapTask 进程） 对每一个<K,V> 调用一次 map（）
2、Reducer 阶段
 自定义的reduce类要继承Reducer<,,,>
 reduce() (ReduceTask 进程) 对每一组相同k 的<K,V> 调用一次reduce () 
3、Driver() 阶段
 相当于Yarn集群的客户端，用于将我们的整个程序提交的Yarn集群，
 提交的是封装了  "MapReduce程序" 相关运行参数的job对象

* 三、序列化

1、序列化的定义：是将内存中的对象转换成字节序列，以便于存储到磁盘或者在网络中传输

2、反序列化的定义：是将序列化对象或者硬盘中持久化的数据，转换成内存对象

3、序列化的原因：一般而言，内存中的数据在断电之后就没有了。同时对于这些内存中的对象只能允许本地进程使用，不能被发送到网络中的另一台计算机上。所以需要将这些内存数据进行序列化，以便于在网络中传输和存储到磁盘。

4、为什么没有使用Java 的序列化：Java 的序列化是一个重量级的序列化框架（Serializable），当一个对象被序列化后，会附带很多的额外信息（例如各种的检验信息），不便于在网路中传输。

5、Hadoop 序列化的特点：

（1）紧凑：高效的使用存储空间（因为检验位比较少），紧凑的格式能够充分利用网络进行传输

（2）快速：读写数据的额外开销小（附带额外信息少）

（3）可扩展：随着通信协议的升级而升级

（4）交互性：支持多种语言交互

6、九类常用的数据序列化类型

7、如何实现自定义Bean 的序列化

方式一：如果需要将自定义的bean放在value中传输，则实现Writable接口

1、自定义bean 需要实现Writable 接口

2、由于反序列化时，需要”反射调用“空参的构造函数，所以必须要有空参的构造函数

3、重写序列化方法write(DataOutput output)

4、重写反序列化方法Readfeilds(DataInput in)

5、注意反序列化的顺序与序列化的顺序一致

6、如果想要在文件中显示出来，则需要从写toString() 方法，可用’\t’作为分隔符

public class FlowBean implements Writable {
    private String phoneNum;
    private long upFlow;
    private long downFlow;
    private long sumFlow;
    
    public FlowBean() {
       
    }    

    public String getPhoneNum() {
        return phoneNum;
    }

    public void setPhoneNum(String phoneNum) {
        this.phoneNum = phoneNum;
    }

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        // TODO Auto-generated method stub
        phoneNum = in.readUTF();
        upFlow = in.readLong();
        downFlow = in.readLong();
        sumFlow = in.readLong();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        // TODO Auto-generated method stub
        out.writeUTF(phoneNum);
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public String toString() {
        return "" + upFlow + "\t" + downFlow + "\t" + sumFlow;
    }
    
}

方式二：如果需要将自定义的bean放在key中传输，则实现WritableComparable接口。为mapreduce框中的shuffle过程一定会对key进行排序

1、必须要实现WritableComparable 接口

2、由于反序列化时，需要 “反射调用” 空参的构造函数，所以必须要有空参的构造函数

3、重写序列化方法 Write(DataOutput out)

4、重写反序列化方法 readFeilds (DataInput in)

5、注意反序列化的顺序与序列化的顺序一致

6、如果想要将结果显示在文件中，则需要重写 toString() 方法，可用“ \t ”作为分隔符

7、如果需要将自定义的javaBean 作为key，则需要重写 comparaTo() 方法因为MapReduce框架的Shuffle 过程要求Key 必须是可以排序的

public class UserBean implements WritableComparable<UserBean> {
    
    private int id;
    private String name ;
    private String age;
    
    public UserBean() {
    }
    
    public UserBean(int id,String name , String age) {
        this.id = id;
        this.name = name;
        this.age = age;
    }
    
      public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getAge() {
        return age;
    }

    public void setAge(String age) {
        this.age = age;
    }
    
    @Override
    public String toString() {
        return this.id + this.name + this.age;
    }
    
    
    //序列化，将字节转化为二进制输出
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(id);
        out.writeUTF(name);
        out.writeUTF(age);
    }
    
      //反序列化，将输入二进制反序列化为字符流
    @Override
    public void readFields(DataInput in) throws IOException {
        id = in.readInt();
        name = in.readUTF();
        age = in.readUTF();
    }
    
    @Override
    public int compareTo(UserBean o) {
         int thisValue = this.id;
         int thatValue = o.id;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
    }

}

四、分区

1、在 MapTask 类中有一个Write() 方法，write方法会调用MapOutputColletor对象的collect 方法。传参为：key,value,partition

write 方法的底层实现：自定义Mapper类->WrappedMapper->TaskInputOutputContextImpl->MapTask类

private final MapOutputCollector<K,V> collector;

public void write(K key, V value) throws IOException, InterruptedException {
collector.collect(key, value,
                 partitioner.getPartition(key, value, partitions));
}

2、分区默认由HashPartitioner的getPartition(key,value,分区数量)，原理为key的hash值与整数的最大值做与操作后，并对分区数量及进行取余

public class HashPartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value,
                       int numReduceTasks) {
 return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

}

3、自定义分区器及其使用

3.1 自定义的分区器需要继承抽象类Partitioner<Key,Value>。并重写里面的getPartition方法

Key 和 value 为Map 阶段的输出的KEY，Value。

public class MyPartition extends Partitioner<FlowBean, Text> {
 @Override
 public int getPartition(FlowBean flowBean, Text text, int numPartitions) {
     String prenum= text.toString().substring(0,3);
     int partition=4;
     if("136".equals(prenum)){
         partition=0;
     }else if("137".equals(prenum)){
         partition=1;
     }else if ("138".equals(prenum)){
         partition=2;
     }else if("139".equals(prenum)){
         partition=3;
     }
     return partition;
 }
}

3.2、在Driver 中设置自定义的分区类以及reduce个数

// 设置自定义分区类
job.setPartitionerClass(MyPartition.class);
// 设置ReduceTask 的数量
job.setNumReduceTasks(3);

3.3 分区总结

1、如果ReduceTask 的数量为 1，则不管Maptask端输出多少个分区，最终都会交给这一个ReduceTask 进行处理，所以只会产生一个输出文件
2、如果 1 < ReduceTask < getPartition的个数，就会存在一部分 分区数据无处安放，抛出异常
3、如果Reducetask 的数量大于 getPartition的个数，则会产生多个空的输出文件

五、切片机制

TextInputFormat 切片机制

框架默认使用TextInputFormat 切片机制，该机制是按照文件对任务划分切片。不管文件多小都会单独的切片并将切片交付给一个MapTask 进行处理。这样如果有大量的小文件，就会造成MapTask 的数量较多，处理效率极其低下。

核心代码如下：

org.apache.hadoop.mapreduce.lib.input.FileInputFormat
	public List<InputSplit> getSplits(JobContext job) throws IOException {
 StopWatch sw = new StopWatch().start();
 // 默认的minsize为1
 long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
 // 默认的maxsize为最大值
 long maxSize = getMaxSplitSize(job);
 // 生成切片的核心代码
 List<InputSplit> splits = new ArrayList<InputSplit>();
 List<FileStatus> files = listStatus(job);
 // 遍历每一个文件
 for (FileStatus file: files) {
   Path path = file.getPath();
   long length = file.getLen();
   if (length != 0) {
     // blocklocation 用于存储每个文件的不同块的位置信息  
     BlockLocation[] blkLocations;
     if (file instanceof LocatedFileStatus) {
       blkLocations = ((LocatedFileStatus) file).getBlockLocations();
     } else {
       FileSystem fs = path.getFileSystem(job.getConfiguration());
       blkLocations = fs.getFileBlockLocations(file, 0, length);
     }
     // 当文件可以划分切片时  
     if (isSplitable(job, path)) {
       // 获取文件系统的块大小，默认为128M  
       long blockSize = file.getBlockSize();
       // 根据块大小，以及最小值，和最大值得到切片的大小，默认128M  
       long splitSize = computeSplitSize(blockSize, minSize, maxSize);
       long bytesRemaining = length;
       // 当文件大小是切片的大小的1.1（SPLIT_SLOP）倍时，按照切片大小存储为一个切片  
       while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
         // 获取切片所在的块的索引  
         int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
         //切片信息主要有：切片的路径、切片的起始偏移量、切片的长度、块所在的主机  
         splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                     blkLocations[blkIndex].getHosts(),
                     blkLocations[blkIndex].getCachedHosts()));
         bytesRemaining -= splitSize;
       }
       // 当文件大小不足切片的1.1 倍时，将剩余文件也划分成一个切片
       if (bytesRemaining != 0) {
         int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
         splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                    blkLocations[blkIndex].getHosts(),
                    blkLocations[blkIndex].getCachedHosts()));
       }
     } else { // not splitable
       splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                   blkLocations[0].getCachedHosts()));
     }
   } else { 
     //Create empty hosts array for zero length files
     splits.add(makeSplit(path, 0, length, new String[0]));
   }
 }
 // Save the number of input files for metrics/loadgen
 job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
 sw.stop();
 if (LOG.isDebugEnabled()) {
   LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
  }

CombineTextInputFormat 切片机制

1、CombineTextInputFormat 用于小文件过多的情景，他将多个小文件逻辑上规划到一个切片中，这样多个小文件就可以交给一个MapTask 处理了。

2、虚拟存储切片的最大值设置

CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

注：虚拟存储切片最大值的设置最好根据实际的小文件大小来确定

3、CombineTextInputFormat 切片机制过程

生成切片过程主要包含两个阶段：虚拟存储和划分切片

虚拟存储过程：将输入文件目录中所有的文件，依次和设置的setMaxInputSplitSize值比较，如果不大于最大值，则逻辑上划分成一个块；如果文件大小大于设置的最大值并且大于两倍，那么以最大值切割一块；当剩余文件大小超过最大值但是小于最大值的二倍，则将这剩余文件均匀的分成两个子文件。

划分切片：判断虚拟存储文件是不是大于设置的setMaxInputSplitSize的最大值，如果大于则单独划分为一个切片；如果不大于设置的最大值则跟下一个虚拟存储文件进行合并，来形成一个切片。

在这里插入图片描述

虚拟存储过程源码分析：虚拟存储借助于静态内部类OneFileInfo

OneFileInfo(FileStatus stat, Configuration conf,
             boolean isSplitable,
             HashMap<String, List<OneBlockInfo>> rackToBlocks,
             HashMap<OneBlockInfo, String[]> blockToNodes,
             HashMap<String, Set<OneBlockInfo>> nodeToBlocks,
             HashMap<String, Set<String>> rackToNodes,
             long maxSize)
             throws IOException {
   this.fileSize = 0;

  // 从文件系统中获取文件快的存储位置
   BlockLocation[] locations;
   if (stat instanceof LocatedFileStatus) {
     locations = ((LocatedFileStatus) stat).getBlockLocations();
   } else {
     FileSystem fs = stat.getPath().getFileSystem(conf);
     locations = fs.getFileBlockLocations(stat, 0, stat.getLen());
   }
   // create a list of all block and their locations
   if (locations == null) {
     blocks = new OneBlockInfo[0];
   } else {

     if(locations.length == 0 && !stat.isDirectory()) {
       locations = new BlockLocation[] { new BlockLocation() };
     }

     if (!isSplitable) {
       blocks = new OneBlockInfo[1];
       fileSize = stat.getLen();
       blocks[0] = new OneBlockInfo(stat.getPath(), 0, fileSize,
           locations[0].getHosts(), locations[0].getTopologyPaths());
     } else {
       ArrayList<OneBlockInfo> blocksList = new ArrayList<OneBlockInfo>(
           locations.length);
       // 虚拟存储的核心代码  
       for (int i = 0; i < locations.length; i++) {
         fileSize += locations[i].getLength();
         long left = locations[i].getLength();
         long myOffset = locations[i].getOffset();
         long myLength = 0;
         do {
           // 如果最大值为0,虚拟文件的大小就为原文的大小  
           if (maxSize == 0) {
             myLength = left;
           } else {
              // 如果源文件的大小大于最大值，但是小于最大值的2倍 
             if (left > maxSize && left < 2 * maxSize) {
               // 则虚拟文件为文件的一半  
               myLength = left / 2;
             } else {
                   //第二种情况：
             2.1 如果文件大小left 大于 设置的最大值，则虚拟文件的大小就为 切片的最大值 
             2.2 如果文件大小left 小于 设置的最大值，则虚拟文件的大小就为 文件大小  
               myLength = Math.min(maxSize, left);
             }
           }
           OneBlockInfo oneblock = new OneBlockInfo(stat.getPath(),
               myOffset, myLength, locations[i].getHosts(),
               locations[i].getTopologyPaths());
           left -= myLength;
           myOffset += myLength;

           blocksList.add(oneblock);
         } while (left > 0);
       }
       blocks = blocksList.toArray(new OneBlockInfo[blocksList.size()]);
     }

     populateBlockInfo(blocks, rackToBlocks, blockToNodes, 
                       nodeToBlocks, rackToNodes);
   }
 }

划分切片的过程核心代码：

void createSplits(Map<String, Set<OneBlockInfo>> nodeToBlocks,
                  Map<OneBlockInfo, String[]> blockToNodes,
                  Map<String, List<OneBlockInfo>> rackToBlocks,
                  long totLength,
                  long maxSize,
                  long minSizeNode,
                  long minSizeRack,
                  List<InputSplit> splits                     
                 ) {
 ArrayList<OneBlockInfo> validBlocks = new ArrayList<OneBlockInfo>();
 // 当前文件的大小  
 long curSplitSize = 0;
 // nodeToBlocks 代表所有节点的数据块的集合   
 int totalNodes = nodeToBlocks.size();
 // 文件的总长度  
 long totalLength = totLength;
 Multiset<String> splitsPerNode = HashMultiset.create();
 Set<String> completedNodes = new HashSet<String>();

 while(true) {
   // 遍历多个节点的逻辑块  
   for (Iterator<Map.Entry<String, Set<OneBlockInfo>>> iter = nodeToBlocks
       .entrySet().iterator(); iter.hasNext();) {
     Map.Entry<String, Set<OneBlockInfo>> one = iter.next();
     // 获取节点的标识
     String node = one.getKey();
     // 如果该节点已经处理，则进行下一个节点  
     if (completedNodes.contains(node)) {
       continue;
     }
       // 获取当前节点的所有块信息
     Set<OneBlockInfo> blocksInCurrentNode = one.getValue();
     Iterator<OneBlockInfo> oneBlockIter = blocksInCurrentNode.iterator();
     while (oneBlockIter.hasNext()) {
       OneBlockInfo oneblock = oneBlockIter.next();
       // 删除可能已分配给其他切片的所有块。 ****** 
       if(!blockToNodes.containsKey(oneblock)) {
         oneBlockIter.remove();
         continue;
       }

       validBlocks.add(oneblock);
       //删除已分配的逻辑块
       blockToNodes.remove(oneblock);
       //当前切片长度需要加上逻辑块的长度  
       curSplitSize += oneblock.length;
       // 如果累加后的切片长度大于最大切片长度则创建一个切片  
       if (maxSize != 0 && curSplitSize >= maxSize) {
         //创建切片
         addCreatedSplit(splits, Collections.singleton(node), validBlocks);
         // 重置其他的信息  
         totalLength -= curSplitSize;
         curSplitSize = 0;

         splitsPerNode.add(node);

         blocksInCurrentNode.removeAll(validBlocks);
         validBlocks.clear();
         break;
       }

     }
       .......
   }

六、不同输入格式

在MapReduce 程序运行时，输入文件的格式包括：基于行的日志文件、二进制文件、数据库表。

FileInputFormat 常见的接口实现类共六种，包括：TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat、SequenceFileInputFormat 以及自定义inputFormat

1、TextInputFormat :是FileInputFormat类的默认实现。按行读取每一条数据，键是该行在整个文件中起始字节的偏移量，类型为LongWritable，值为这一行的内容，类型为Text。

2、KeyValueTextInputFormat：每一行均为一条记录，每一条记录被分割符分割为key和value。
//在配置文件中，设置key value 之间的切割符
conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR," ");
//在Job 中，设置输入格式
job.setInputFormatClass(KeyValueTextInputFormat.class);
3、NLineInputFormat:如果使用NLineInputFormat, 则每一个Map进程处理的切片inputSplit 不再按照Block块大小进行划分，而是根据NLineInputFormat指定的行数N 进行划分。即输入文件的总行数/N=切片数，如果除不尽，则切片数=商+1

4、CombineFileInputFormat: TextInputFormat 切片机制是按照文件规划切片，不管文件多小，都是一个单独的切片，并交给一个MapTask 进行处理。这样如果有大量的小文件，就会产生到大量的MapTask，处理效率极低。

CombineFileInputFormat 用于小文件的过多的场景，它可以将多个小文件从逻辑上一个切片中。这样多个小文件就会交给到一个Maptask 进行处理。
// 设置InputFormat
job.setInputFormatClass(CombineTextInputFormat.class);
// 设置虚拟切片的最大值和最小值
CombineTextInputFormat.setMaxInputSplitSize(job,1024*1024*4);
CombineTextInputFormat.setMinInputSplitSize(job,1024*1024*1);
5、SequenceFileInputFormat ：只能用于处理SequenceFile类型的文件。

SequenceFile 文件是Hadoop 用来存储二进制形式的【key,value】而设计的一种平面文件。

SequenceFile 中的key 和 value 可以是任意类型的Writable 和自定义的 Writable。

在存储结构上，SequenceFile 主要由一个header 后跟多条Record 组成。Header主要包含key的类名，value 的类名，压缩算法，用户自定义元数据信息、以及一些同步标识（用于快速定位到记录的标识）。

每条Record 以键值对的方式进行存储，可以被解析成：记录的长度、key 的长度、key的值和value 的值，value值的结构取决于该记录是否被压缩。

有三种类型的压缩：

A、无压缩类型：如果没有启动压缩，每个记录就有它的记录长度，key的长度，键值和value 值

B、记录压缩类型：记录压缩和无压缩格式基本相同，不同的是值字节是用定义在头部的编码器来压缩的。注意：键是不压缩的。下图为记录压缩：

在这里插入图片描述

C、块压缩类型：块压缩一次压缩多个记录，比记录压缩更紧凑，而且一般优先选择。

SequenceFile优点：
A.支持基于记录(Record)或块(Block)的数据压缩。
B.支持splitable，能够作为MapReduce的输入分片。
C.修改简单：主要负责修改相应的业务逻辑，而不用考虑具体的存储格式。

SequenceFile的缺点
A.需要一个合并文件的过程，且合并后的文件不方便查看。

6、自定义inputFormat，以合并多个小文件为例

步骤：

A、自定义一个类继承FileInputFormat

重写isSplitable() 方法，返回false标识不可被切割

重写createRecordReader() 方法，创建一个自定义的RecordReader对象，并进行初始化

B、改写RecorderReader,实现一次读取一个完整文件封装为KV

采用IO流一次读取一个文件输出到value中。因为设置了不可切片，最终把所有文件都封装到value中。

获取文件的路径信息+名称，并设置为key

C、设置Driver
// 7设置输入的inputFormat
job.setInputFormatClass(WholeFileInputformat.class);

// 8设置输出的outputFormat
job.setOutputFormatClass(SequenceFileOutputFormat.class);

// 自定义：inputFormat
public class WholeFileInputformat extends FileInputFormat<Text, BytesWritable>{
	
	@Override
	protected boolean isSplitable(JobContext context, Path filename) {
		return false;
	}

	@Override
	public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context)	throws IOException, InterruptedException {
		
		WholeRecordReader recordReader = new WholeRecordReader();
		recordReader.initialize(split, context);
		
		return recordReader;
	}
}
// 自定义RecorderReader

public class WholeRecordReader extends RecordReader<Text, BytesWritable>{

	private Configuration configuration;
	private FileSplit split;
	
	private boolean isProgress= true;
	private BytesWritable value = new BytesWritable();
	private Text k = new Text();

	@Override
	public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
		
		this.split = (FileSplit)split;
		configuration = context.getConfiguration();
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		
		if (isProgress) {

			// 1 定义缓存区
			byte[] contents = new byte[(int)split.getLength()];
			
			FileSystem fs = null;
			FSDataInputStream fis = null;
			
			try {
				// 2 获取文件系统
				Path path = split.getPath();
				fs = path.getFileSystem(configuration);
				
				// 3 读取数据
				fis = fs.open(path);
				
				// 4 读取文件内容
				IOUtils.readFully(fis, contents, 0, contents.length);
				
				// 5 输出文件内容
				value.set(contents, 0, contents.length);

               // 6 获取文件路径及名称
              String name = split.getPath().toString();

              // 7 设置输出的key值
              k.set(name);

			} catch (Exception e) {
				
			}finally {
				IOUtils.closeStream(fis);
			}
			
			isProgress = false;
			
			return true;
		}
		
		return false;
	}

	@Override
	public Text getCurrentKey() throws IOException, InterruptedException {
		return k;
	}

	@Override
	public BytesWritable getCurrentValue() throws IOException, InterruptedException {
		return value;
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		return 0;
	}

	@Override
	public void close() throws IOException {
	}
}

七、MapReduce 提交流程

步骤一：在Driver 类中调用waitForCompletion(true) 方法

boolean b = job.waitForCompletion(true);

步骤二：waitForCompletion(True) 方法属于 org.apache.hadoop.mapreduce.Job.class 类

waitForCompletion 的方法的源码如下：

1、作业的状态分为两种：定义态和运行态，定义在一个枚举类

public static enum JobState {DEFINE, RUNNING}

2、如果作业状态为定义态时，就会调用submit() 方法

public boolean waitForCompletion(boolean verbose
                              ) throws IOException, InterruptedException,
                                       ClassNotFoundException {
if (state == JobState.DEFINE) {
 // 如果作业态为定义态则调用submit() 方法  
 submit();
}
if (verbose) {
 monitorAndPrintJob();
} else {
 // get the completion poll interval from the client.
 int completionPollIntervalMillis = 
   Job.getCompletionPollInterval(cluster.getConf());
 while (!isComplete()) {
   try {
     Thread.sleep(completionPollIntervalMillis);
   } catch (InterruptedException ie) {
   }
 }
}
return isSuccessful();
}

步骤三：

submit方法所在的类：org.apache.hadoop.mapreduce.Job.class

submit 方法的源码如下：

1、首先调用ensureState(JobState.DEFINE)方法确保我们的作业状态为定义态

2、其次调用setUseNewAPI()方法来使用新的API

3、再次调用connect() 方法，该方法主要用来判断cluster 是否为空。如果为空则返回一个新的集群对象

private Cluster cluster;
private synchronized void connect()
     throws IOException, InterruptedException, ClassNotFoundException {
if (cluster == null) {
 cluster = 
   ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
              public Cluster run()
                     throws IOException, InterruptedException, 
                            ClassNotFoundException {
                return new Cluster(getConfiguration());
              }
            });
}
}

4、之后调用getJobSubmitter() 方法获取作业提交器，该方法的参数：集群中的文件系统FS，以及集群的客户端Client

final JobSubmitter submitter = 
     getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

5、之后调用作业提交器的submitJobInternal，将作业提交到集群

status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
   public JobStatus run() throws IOException, InterruptedException, 
   ClassNotFoundException {
     return submitter.submitJobInternal(Job.this, cluster);
   }
 });

步骤四

submitJobInternal () 方法所在的类：org.apache.hadoop.mapreduce.JobSubmitter

提交的细节

1、首先调用checkSpecs(Job) 方法检查输出路径的有效性

2、根据配置信息，调用addMRFrameworkToDistributedCache(conf)将MR 框架加载到分布式缓存中

3、之后获取作业提交的临时路径

Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);

4、获取作业ID

JobID jobId = submitClient.getNewJobID();

5、根据临时路径和jobid 来创建作业提交的最终路径

Path submitJobDir = new Path(jobStagingArea, jobId.toString());

6、调用copyAndConfigureFiles(job, submitJobDir) 用于将我们的作业和配置文件进行提交

7、获取切片数量

int maps = writeSplits(job, submitJobDir);

步骤五、

writeSplits() 方法所在的类为：org.apache.hadoop.mapreduce.JobSubmitter

1、该方法主要根据job的配置，判断是否使用的新的API,以调用切片方法

private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
   Path jobSubmitDir) throws IOException,
   InterruptedException, ClassNotFoundException {
 JobConf jConf = (JobConf)job.getConfiguration();
 int maps;
 if (jConf.getUseNewMapper()) {
   maps = writeNewSplits(job, jobSubmitDir);
 } else {
   maps = writeOldSplits(jConf, jobSubmitDir);
 }
 return maps;
}

步骤六

writeNewSplits()方法所在的类为org.apache.hadoop.mapreduce.JobSubmitter

1、首先通过反射的方式获取输入格式化类

InputFormat<?, ?> input =
   ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

2、根据输入的格式化类获取切片,返回一个list

List<InputSplit> splits = input.getSplits(job);

private <T extends InputSplit>
int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
   InterruptedException, ClassNotFoundException {
 Configuration conf = job.getConfiguration();
 InputFormat<?, ?> input =
   ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

 List<InputSplit> splits = input.getSplits(job);
 T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

 // sort the splits into order based on size, so that the biggest
 // go first
 Arrays.sort(array, new SplitComparator());
 JobSplitWriter.createSplitFiles(jobSubmitDir, conf, 
     jobSubmitDir.getFileSystem(conf), array);
 return array.length;
}

步骤七

getSplits() 方法默认由：org.apache.hadoop.mapreduce.lib.input.FileInputFormat类实现

1、首先调用getFormatMinSplitSize() 方法获取输入格式化类最小切片大小，默认为1

protected long getFormatMinSplitSize() {return 1;}

2、同时调用getMinSplitSize(job) 方法获取作业的最小切片，如果没有定义SPLIT_MINSIZE，则默认值为1

public static final String SPLIT_MINSIZE = "mapreduce.input.fileinputformat.split.minsize";

public static long getMinSplitSize(JobContext job) {
return job.getConfiguration().getLong(SPLIT_MINSIZE, 1L);
}

3、之后调用= getMaxSplitSize(job)方法获取作业的最大切片，如果没有定义SPLIT_MAXSIZE，则默认返回Long的最大值

public static long getMaxSplitSize(JobContext context) {
 return context.getConfiguration().getLong(SPLIT_MAXSIZE, Long.MAX_VALUE);
}

4、获取job 的所有文件，并对每一个文件进行遍历

List<FileStatus> files = listStatus(job)；
  for (FileStatus file: files)

5、首先获取文件的每一个块的所在位置

BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
   blkLocations = ((LocatedFileStatus) file).getBlockLocations();
 } else {
    FileSystem fs = path.getFileSystem(job.getConfiguration());
    blkLocations = fs.getFileBlockLocations(file, 0, length);
  }

6、之后获取块大小

long blockSize = file.getBlockSize();

7、调用computeSplitSize(blocksize,minsize,maxsize) 方法，获取切片大小

long splitSize = computeSplitSize(blockSize, minSize, maxSize)

protected long computeSplitSize(long blockSize, long minSize,long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}

8、在获取到切片大小之后，根据文件大小和切片大小来划分切片。切片的条件为：文件长度是切片的1.1 倍时才进行切片。

8.1 首选获取切片的所在的块的索引

protected int getBlockIndex(BlockLocation[] blkLocations, 
                         long offset) {
for (int i = 0 ; i < blkLocations.length; i++) {
 //如果切片的偏移量大于  块的起始偏移量  小于最大偏移量,则该切片存储在该块中
 if ((blkLocations[i].getOffset() <= offset) &&
     (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
   return i;
 }
}

8.2 然后添加切片信息

主要是：切片的路径，切片的起始偏移量，切片的长度，存储切片块的主机

while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
         int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
         splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                     blkLocations[blkIndex].getHosts(),
                     blkLocations[blkIndex].getCachedHosts()));
         bytesRemaining -= splitSize;
       }

       if (bytesRemaining != 0) {
         int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
         splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                    blkLocations[blkIndex].getHosts(),
                    blkLocations[blkIndex].getCachedHosts()));
       }

八、MapReduce 详细工作流程

(1)、Map 阶段

在这里插入图片描述

1、首先有一个200M的待处理文件

2、切片：在客户端提交之前（即submit() 方法调用前），根据参数配置，进行任务的划分，按照每128M每块进行划分

3、提交：划分完切片之后就可以进行提交。可以将任务提交到本地工作环境也可以提交yarn 工作环境。本地工作环境只需要提交切片信息和xml 配置文件信息。Yarn工作环境还需要提交jar包信息。

4、启动maptask：根据切片的数量来计算MapTask的数量，每一个Maptask 都是并行执行的

5、MapTask 会执行mapper类中的map() 方法：map 方法需要K和V作为相关的输入参数，因此需要首先获取到K和V的值
1、首先通过driver的setInputFormat()方法来指定输入的格式化类
2、KV值的获取是通过InputFormat读取外部文件获取到的，InputFormat的默认实现是TextInputFormat
3、TextInputFormat 会调用createRecorderReader() 方法，来创建一个RecorderReader对象
4、RecorderReader 会调用nextKeyvalue 方法读取kv 数据，并传递给map 方法
6、Mapper 类的mapper 经过一系列的逻辑操作后，最后执行context.write(k,v) 进行写出

7、map 方法如果直接将输出结果写出到reduce,相当于直接操作磁盘，这样会造成大量的 I/O 操作，效率太低。因此在Map 和 Reduce 之间存在一个Shuffle 过程。

（2）、Shuffle 阶段

1、Map 类处理完毕后，会调用MapTask 类里面的MapOutputController 对象的collect （）方法，将MapTask 搜集到的数据写入到环形缓冲区。collect () 方法的传参为key 值，value 值，以及分区

MapOutputController 的默认实现为MapOutputBuffer
public void write(K key, V value) throws IOException, InterruptedException {
collector.collect(key, value,
              partitioner.getPartition(key, value, partitions));
}
// 默认调用hashPartitioner 类的getPartition（）
public int getPartition(K key, V value,
                  int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
2、环形缓冲区其实就是一个数组，主要有两个部分组成：一部分写入文件的元数据信息，另一部分写入文件的真实信息。环形缓冲区的大小默认为100M,当环形缓冲区的容量达到了80%，就要进行反向溢写。反向溢写的原因在于这样可以一边接收数据一边向磁盘中接收数据。

3、反向溢写之前，需要调用partitioner 对缓冲区的数据进行分区并且按照key 对数据进行排序（快排）

4、在分区和排序后，需要将数据溢写到磁盘。再次期间可能会发生多次溢写，生成多个溢写文件。

5、在溢写操作完毕后，需要将这些溢写文件进行归并排序，形成分区内数据有序的大文件。合并生成大的文件之后，也就意味着shuffle 过程结束。

6、在溢写操作完毕后，归并排序之前，可能存在一个combine合并操作。其目的在于对于每一个MapTask的输出进行局部汇总，以减少网络的传输量。

7、注意：shuffle 缓冲区的大小回影响到MapReduce 程序的执行效率，原则上，缓冲区越大，磁盘IO 的次数越少，执行速度也就越快。缓冲区的大小可以通过参数调整：参数：io.sort.mb默认100M。

（3）、Reduce 阶段

在这里插入图片描述

1、合并生成大文件后，也就意味着Shuffle过程结束了，之后进入到Reduce 阶段。

2、首先会启动相应数量的ReduceTask,（ReduceTask数量与设置分区数量一致），并告知每一个ReduceTask 所能处理的数据范围

3、Copy阶段：ReduceTask 会从各个Maptask的输出数据中远程拷贝一份数据，并针对这一片数据，如果其大小超过了一定的阈值，则直接写到磁盘上，否在存放于内存中。

4、Merge阶段: 在远程拷贝的同时，ReduceTask 会同时开启两个线程对内存和磁盘中的文件进行合并，以防止内存使用过多或者磁盘文件过多。

5、Sort 阶段：按照MapReduce 的语义,用户编写的reduce () 方法输入的是按照key 进行聚合的一组数据。由于MapTask 已经对于自己的处理结果进行局部排序，因此Reduce task只需要对数据在进行依次归并排序即可（一个分区存在多个Key,如果只有一个key 则不需要进行排序）

6、Reduce 阶段：reduce 方法() 进行相关业务逻辑的处理，并写出到HDFS上面。

九、MapTask 源码分析

1、Mapper 类：主要包含四种方法: setup()、map()、cleanup()、run()

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public abstract class Context
 implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
//1、方法一：setup() 方法，该方法在Maptask任务的开始阶段被调用一次
protected void setup(Context context
                    ) throws IOException, InterruptedException {
}

//2、方法二：map() 方法，该方法会被输入切片的中的每一个KV对调用一次，
// 同时该方法应该被重写
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value, 
                  Context context) throws IOException, InterruptedException {
 context.write((KEYOUT) key, (VALUEOUT) value);
}

// 3、方法三：cleanup() 方法，该方法在Maptask任务的结束阶段被调用一次
protected void cleanup(Context context
                      ) throws IOException, InterruptedException {
}

//4、方法四：run() 方法，对以上三个方法的同意调用
// 对于run() 方法最终在mapTask 类中调用  
public void run(Context context) throws IOException, InterruptedException {
 setup(context);
 try {
   while (context.nextKeyValue()) {
     map(context.getCurrentKey(), context.getCurrentValue(), context);
   }
 } finally {
   cleanup(context);
 }
}
}

2、MapTask 类：org.apache.hadoop.mapred.MapTask

第一步：run() 方法

1、mapTask 首先判断是否存在Reduce任务，如果没有reduce 任务则只有map 阶段。当map 阶段结束后，整个作业也就结束了；如果存在reduce，则真个map 阶段又分为 map阶段和排序阶段，当map阶段完成后整个进度只是完成了66.7%，Sort完成的时候设置进度为33.3%。

2、调用 runNewMapper(job, splitMetaInfo, umbilical, reporter)

@Override
public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
 throws IOException, ClassNotFoundException, InterruptedException {
 this.umbilical = umbilical;

 if (isMapTask()) {
   if (conf.getNumReduceTasks() == 0) {
    //MapTask先判断是否有Reduce任务；没有reduce, 则只有map阶段，Map阶段结束则整个job可以结束   
     mapPhase = getProgress().addPhase("map", 1.0f);
   } else { 
  // 如果有reduce 阶段，那么整个进程将是在映射阶段（67%）和排序阶段（33%）之间分离。
     mapPhase = getProgress().addPhase("map", 0.667f);
     sortPhase  = getProgress().addPhase("sort", 0.333f);
   }
 }
 TaskReporter reporter = startReporter(umbilical);
 boolean useNewApi = job.getUseNewMapper();
 initialize(job, getJobID(), reporter, useNewApi);
 if (jobCleanup) {
   runJobCleanupTask(umbilical, reporter);
   return;
 }
 if (jobSetup) {
   runJobSetupTask(umbilical, reporter);
   return;
 }
 if (taskCleanup) {
   runTaskCleanupTask(umbilical, reporter);
   return;
 }
 if (useNewApi) {
  // 调用 runNewMapper(job, splitMetaInfo, umbilical, reporter)  
   runNewMapper(job, splitMetaInfo, umbilical, reporter);
 } else {
   runOldMapper(job, splitMetaInfo, umbilical, reporter);
 }
 done(umbilical, reporter);
}

第二步：runNewMapper() 方法

1、根据jobConf、任务ID 来获取任务的上下文taskContext

2、利用任务上下文taskContext.getMapperClass() 通过反射的方式获取我们自己定义的mapper 类

3、利用任务上下文taskContext.getInputFormatClass() 通过反射的方式获取输入格式化类，默认为TextInputFormat

4、重新构建切片信息。我们在客户端传递过来的切片是一个切片组，里面包含了很多切片。但是一个maptask 只能处理一个切片，因此需要我们自己重构切片。

5、通过NewTrackingRecorderReader类的构造方法以多态的方式去构造一个recordreader 对象，由于我们的inputformat 默认为TextInputformat 所以默认返回的是LineRecordReader。

6、通过NewOutPutCollector() 构造方法来获取一个RecorderWriter 对象.

newOutputCollect 构造方法中主要有两个参数一个MapOutputCollector搜集器，另一个是partitioner 分区器

a、当我们的分区数为1 时则返回一个分区为0的partitioner；当我们的分区数大于1时则通过反射的方式返回一个分区类，默认的时HashPartitioner。

b、通过createSortingCollector() 方法来获取一个搜集器，该方法默认情况下collector= MapOutputBuffer 即环形缓冲区

搜集器的collector.init(context)

public void init(MapOutputCollector.Context context
                 ) throws IOException, ClassNotFoundException {
   job = context.getJobConf();
   reporter = context.getReporter();
   mapTask = context.getMapTask();
   mapOutputFile = mapTask.getMapOutputFile();
   sortPhase = mapTask.getSortPhase();
   spilledRecordsCounter = reporter.getCounter(TaskCounter.SPILLED_RECORDS);
   partitions = job.getNumReduceTasks();
   rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();

   // 溢出比80%
   final float spillper =
     job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
  // 环形缓冲区的大小为100M
   final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);
   indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,
                                      INDEX_CACHE_MEMORY_LIMIT_DEFAULT);
   if (spillper > (float)1.0 || spillper <= (float)0.0) {
     throw new IOException("Invalid \"" + JobContext.MAP_SORT_SPILL_PERCENT +
         "\": " + spillper);
   }
   if ((sortmb & 0x7FF) != sortmb) {
     throw new IOException(
         "Invalid \"" + JobContext.IO_SORT_MB + "\": " + sortmb);
   }
  // 排序方法：快排
   sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",
         QuickSort.class, IndexedSorter.class), job);
  //  环形缓冲区的字节数
   int maxMemUsage = sortmb << 20;
   maxMemUsage -= maxMemUsage % METASIZE;
   kvbuffer = new byte[maxMemUsage];
   bufvoid = kvbuffer.length;
   kvmeta = ByteBuffer.wrap(kvbuffer)
      .order(ByteOrder.nativeOrder())
      .asIntBuffer();
   setEquator(0);
   bufstart = bufend = bufindex = equator;
   kvstart = kvend = kvindex;

   maxRec = kvmeta.capacity() / NMETA;
   softLimit = (int)(kvbuffer.length * spillper);
   bufferRemaining = softLimit;
   LOG.info(JobContext.IO_SORT_MB + ": " + sortmb);
   LOG.info("soft limit at " + softLimit);
   LOG.info("bufstart = " + bufstart + "; bufvoid = " + bufvoid);
   LOG.info("kvstart = " + kvstart + "; length = " + maxRec);

//comparator:排序比较器如果我们定义了排序比较器则返回我们自定义的
//             如果我们没有定义则返回key自身的比较器
   comparator = job.getOutputKeyComparator();
   keyClass = (Class<K>)job.getMapOutputKeyClass();
   valClass = (Class<V>)job.getMapOutputValueClass();
   serializationFactory = new SerializationFactory(job);
   keySerializer = serializationFactory.getSerializer(keyClass);
   keySerializer.open(bb);
   valSerializer = serializationFactory.getSerializer(valClass);
   valSerializer.open(bb);

   // output counters
   mapOutputByteCounter = reporter.getCounter(TaskCounter.MAP_OUTPUT_BYTES);
   mapOutputRecordCounter =
     reporter.getCounter(TaskCounter.MAP_OUTPUT_RECORDS);
   fileOutputByteCounter = reporter
       .getCounter(TaskCounter.MAP_OUTPUT_MATERIALIZED_BYTES);

   // compression：压缩
   if (job.getCompressMapOutput()) {
     Class<? extends CompressionCodec> codecClass =
       job.getMapOutputCompressorClass(DefaultCodec.class);
     codec = ReflectionUtils.newInstance(codecClass, job);
   } else {
     codec = null;
   }

   // combiner：预聚合
   final Counters.Counter combineInputCounter =
     reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
   combinerRunner = CombinerRunner.create(job, getTaskID(), 
                                          combineInputCounter,
                                          reporter, null);
   if (combinerRunner != null) {
     final Counters.Counter combineOutputCounter =
       reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
     combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
   } else {
     combineCollector = null;
   }
   spillInProgress = false;
   minSpillsForCombine = job.getInt(JobContext.MAP_COMBINE_MIN_SPILLS, 3);
  //spillThread:溢写线程
   spillThread.setDaemon(true);
   spillThread.setName("SpillThread");
   spillLock.lock();
   try {
     spillThread.start();
     while (!spillThreadRunning) {
       spillDone.await();
     }
   } catch (InterruptedException e) {
     throw new IOException("Spill thread failed to initialize", e);
   } finally {
     spillLock.unlock();
   }
   if (sortSpillException != null) {
     throw new IOException("Spill thread failed to initialize",
         sortSpillException);
   }
 }

7、创建mapContext = new MapContextImpl （） map上下文对象

8、调用input.initialize(split, mapperContext) 初始化recorderReader 对象，TextInputFormat 对应的recorderReader 为 LineRecorderReader。

9、调用mapper类的run() 方法来运行map阶段

10、通过recoderreader 的 nextkeyvalue 方法判断是否存在下一条记录并给k，v 的赋值，返回一个boolean

11、在mapper类处理完毕后，就需要写出数据调用mapoutputBuffer 类的collect 方法往环形缓冲区写入数据：

分区信息，key的开始索引，value 的开始索引，value 的长度

     kvmeta.put(kvindex + PARTITION, partition);
     kvmeta.put(kvindex + KEYSTART, keystart);
     kvmeta.put(kvindex + VALSTART, valstart);
     kvmeta.put(kvindex + VALLEN, distanceTo(valstart, valend));

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewMapper(final JobConf job,
                 final TaskSplitIndex splitIndex,
                 final TaskUmbilicalProtocol umbilical,
                 TaskReporter reporter
                 ) throws IOException, ClassNotFoundException,
                          InterruptedException {
 //1、根据jobConf、任务ID 来获取任务的上下文taskContext
 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
  getTaskID(),reporter);
 // 2、利用任务上下文taskContext.getMapperClass() 通过反射的方式获取我们自己定义的mapper 类
 org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =  (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
     ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
 // 3、利用任务上下文taskContext.getInputFormatClass() 通过反射的方式获取输入格式化类，默认为TextInputFormat
 org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
   (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
     ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
 // 4、重新构建切片信息。我们在客户端传递过来的切片是一个切片组，里面包含了很多切片。但是一个maptask 只能处理一个切片，因此需要我们自己重构切片。
 org.apache.hadoop.mapreduce.InputSplit split = null;
 split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
     splitIndex.getStartOffset());
 LOG.info("Processing split: " + split);
// 5、通过NewTrackingRecorderReader类的构造方法以多态的方式去构造一个recordreader 对象，由于我们的inputformat 默认为TextInputformat 所以默认返回的是LineRecordReader。
 org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
   new NewTrackingRecordReader<INKEY,INVALUE>
     (split, inputFormat, reporter, taskContext);

 job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
 org.apache.hadoop.mapreduce.RecordWriter output = null;

 if (job.getNumReduceTasks() == 0) {
   output = 
     new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
 } else {
   //6、通过NewOutPutCollector() 构造方法来获取一个RecorderWriter 对象.
   //  newOutputCollect 构造方法中主要有两个参数一个MapOutputCollector搜集器，另一个是partitioner 分区器
newOutputCollect 构造方法中主要有两个参数一个MapOutputCollector搜集器，另一个是partitioner 分区器	  
   output = new NewOutputCollector(taskContext, job, umbilical, reporter);
 }

 org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
 //  7、创建mapContext = new MapContextImpl （） map上下文对象  
 mapContext = 
   new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
       input, output, 
       committer, 
       reporter, split);

 org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
     mapperContext = 
       new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
           mapContext);

 try {
   //8、调用input.initialize(split, mapperContext) 初始化recorderReader 对象，TextInputFormat 对应的recorderReader 为 LineRecorderReader。  
   input.initialize(split, mapperContext);
   //9、调用mapper类的run() 方法来运行map阶段
   //10、通过recoderreader 的 nextkeyvalue 方法判断是否存在下一条记录并给k，v 的赋值，返回一个boolean
   mapper.run(mapperContext);
   mapPhase.complete();
   //11、在mapper类处理完毕后，就需要写出数据调用mapoutputBuffer 类的collect 方法往环形缓冲区写入数据：分区信息，key的开始索引，value 的开始索引，value 的长度    
   setPhase(TaskStatus.Phase.SORT);
   statusUpdate(umbilical);
   input.close();
   input = null;
   output.close(mapperContext);
   output = null;
 } finally {
   closeQuietly(input);
   closeQuietly(output, mapperContext);
 }
}

十、ReduceTask源码解析

1、Reducer 类：主要包含四种方法: setup()、reduce()、cleanup()、run()

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {

public abstract class Context 
implements ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}

// setUp() 方法在reduceTask 任务开始阶段被调用一次
protected void setup(Context context
                 ) throws IOException, InterruptedException {
}

// reduce() 方法当每一组相同key 的value 到达时被调用一次，同时该方法要被重写
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
                  ) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}

//cleanup() 方法当ReduceTask 任务结束阶段被调用一次 
protected void cleanup(Context context
                   ) throws IOException, InterruptedException {
// NOTHING
}

// run() 方法是对上述三种方法的统一调用
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
  reduce(context.getCurrentKey(), context.getValues(), context);
  Iterator<VALUEIN> iter = context.getValues().iterator();
  if(iter instanceof ReduceContext.ValueIterator) {
    ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
  }
}
} finally {
cleanup(context);
}
}
}

注意:mapper 类和 Reducer 类在run() 方法中的区别

map    :while (context.nextKeyValue()) 每一条记录调用一次map
reduce :while (context.nextKey())      每一组记录调用一次reduce

2、ReduceTask 类：org.apache.hadoop.mapred.ReduceTask

步骤一：run 方法（）

A、RawKeyValueIterator rIter = shuffleConsumerPlugin.run()

Reduce 拉回属于自己的数据，并包装成一个迭代器，这是一个真迭代器

B、RawComparator comparator = job.getOutputValueGroupingComparator();

Reduce 获取分组比较器

如果用户定义了分组比较器，则使用用户自定义的分组比较器。

如果用户没有定义分组比较器，则使用自定义的排序比较器

如果用户没有定义排序比较器，则使用key 自身的比较器

public RawComparator getOutputValueGroupingComparator() {
Class<? extends RawComparator> theClass = getClass(
JobContext.GROUP_COMPARATOR_CLASS, null, RawComparator.class);
if (theClass == null) {
return getOutputKeyComparator();
}

return ReflectionUtils.newInstance(theClass, this);
}

在这里插入图片描述

C、调用

runNewReducer(job, umbilical, reporter, rIter, comparator, java，keyClass, valueClass);

D、run 方法的完整源码

public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
throws IOException, InterruptedException, ClassNotFoundException {
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

if (isMapOrReduce()) {
copyPhase = getProgress().addPhase("copy");
sortPhase  = getProgress().addPhase("sort");
reducePhase = getProgress().addPhase("reduce");
}
// start thread that will handle communication with parent
TaskReporter reporter = startReporter(umbilical);

boolean useNewApi = job.getUseNewReducer();
initialize(job, getJobID(), reporter, useNewApi);

// check if it is a cleanupJobTask
if (jobCleanup) {
runJobCleanupTask(umbilical, reporter);
return;
}
if (jobSetup) {
runJobSetupTask(umbilical, reporter);
return;
}
if (taskCleanup) {
runTaskCleanupTask(umbilical, reporter);
return;
}

// Initialize the codec
codec = initCodec();
RawKeyValueIterator rIter = null;
ShuffleConsumerPlugin shuffleConsumerPlugin = null;

Class combinerClass = conf.getCombinerClass();
CombineOutputCollector combineCollector = 
(null != combinerClass) ? 
new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;

Class<? extends ShuffleConsumerPlugin> clazz =
  job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);

shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);
LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);

ShuffleConsumerPlugin.Context shuffleContext = 
new ShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical, 
          super.lDirAlloc, reporter, codec, 
          combinerClass, combineCollector, 
          spilledRecordsCounter, reduceCombineInputCounter,
          shuffledMapsCounter,
          reduceShuffleBytes, failedShuffleCounter,
          mergedMapOutputsCounter,
          taskStatus, copyPhase, sortPhase, this,
          mapOutputFile, localMapFiles);
shuffleConsumerPlugin.init(shuffleContext);
// 获取迭代器对象
rIter = shuffleConsumerPlugin.run();
mapOutputFilesOnDisk.clear();
sortPhase.complete();                        
setPhase(TaskStatus.Phase.REDUCE); 
statusUpdate(umbilical);
Class keyClass = job.getMapOutputKeyClass();
Class valueClass = job.getMapOutputValueClass();
// 获取分组比较器 
RawComparator comparator = job.getOutputValueGroupingComparator();

if (useNewApi) {
//调用  runNewReducer（） 方法   
runNewReducer(job, umbilical, reporter, rIter, comparator, 
            keyClass, valueClass);
} else {
runOldReducer(job, umbilical, reporter, rIter, comparator, 
            keyClass, valueClass);
}
shuffleConsumerPlugin.close();
done(umbilical, reporter);
}

步骤二：runNewReducer() 方法

A、获取同一key的迭代器对象

B、通过反射的方式获取taskContext

C、通过反射的方式获取Reducer 类

D、获取reduce输出的格式化类

E、创建reduceContext 即reduce的上下文

org.apache.hadoop.mapreduce.Reducer.Context 
 reducerContext = createReduceContext(reducer, job, getTaskID(),
                                       rIter, reduceInputKeyCounter, 
                                       reduceInputValueCounter, 
                                       trackedRW,
                                       committer,
                                       reporter, comparator, keyClass,
                                       valueClass);

createReduceContext => new ReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,rIter, inputKeyCounter, inputValueCounter, output,....) 
（createReduceContext 方法底层调用了ReduceContextImpl 对象）

F、调用reducer类的run 方法

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewReducer(JobConf job,
             final TaskUmbilicalProtocol umbilical,
             final TaskReporter reporter,
             RawKeyValueIterator rIter,
             RawComparator<INKEY> comparator,
             Class<INKEY> keyClass,
             Class<INVALUE> valueClass
             ) throws IOException,InterruptedException, 
                      ClassNotFoundException {
//1、 获取迭代器对象
final RawKeyValueIterator rawIter = rIter;
rIter = new RawKeyValueIterator() {
public void close() throws IOException {
rawIter.close();
}
public DataInputBuffer getKey() throws IOException {
return rawIter.getKey();
}
public Progress getProgress() {
return rawIter.getProgress();
}
public DataInputBuffer getValue() throws IOException {
return rawIter.getValue();
}
public boolean next() throws IOException {
boolean ret = rawIter.next();
reporter.setProgress(rawIter.getProgress().getProgress());
return ret;
}
};
// 获取任务执行的上下文
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
  getTaskID(), reporter);
// 获取reduce类
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
(org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
//获取recorderWriter     
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW = 
new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
job.setBoolean("mapred.skip.on", isSkipping());
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
//获取reducerContext
org.apache.hadoop.mapreduce.Reducer.Context 
 reducerContext = createReduceContext(reducer, job, getTaskID(),
                                       rIter, reduceInputKeyCounter, 
                                       reduceInputValueCounter, 
                                       trackedRW,
                                       committer,
                                       reporter, comparator, keyClass,
                                       valueClass);
try {
reducer.run(reducerContext);
} finally {
trackedRW.close(reducerContext);
}
}

*十一、MapReduce 优化

1）数据输入：
（1）合并小文件：在执行mr任务前将小文件进行合并，大量的小文件会产生大量的map任务，增大map任务装载次数，而任务的装载比较耗时，从而导致 mr 运行较慢。
（2）采用CombinFileInputFormat来作为输入，将多个小文件在逻辑上划分成一个大文件。

2）map阶段
	（1）减少spill次数：通过调整io.sort.mb（环形缓冲区得大小）及sort.spill.percent（溢写比率）参数值，增大触发spill的内存上限，减少spill次数，从而减少磁盘 IO。
	（2）减少merge次数：通过调整io.sort.factor参数，增大merge的文件数目，减少merge的次数，从而缩短mr处理时间。
	（3）在 map 之后先进行combine处理，减少 I/O。

3）reduce阶段
（1）设置map、reduce共存：调整slowstart.completedmaps参数，使map运行到一定程度后，reduce也开始运行，减少reduce的等待时间。
（2）规避使用reduce，因为Reduce在用于连接数据集的时候将会产生大量的网络消耗。map join 代替 reduce join

（4）IO传输
（1）采用数据压缩的方式，减少网络IO的的时间。安装Snappy和LZOP压缩编码器。
（2）使用SequenceFile二进制文件

十二、MapReduce join 实现

1、Reduce Join

Reduce Join是最简单的一种Join 方式，主要思想如下：

Map 阶段：map 函数会同时读取两个文件File1 和 File2。 把关键子作为key,Value 为数据+数据来源File1(或File2)。

Reduce 阶段：Reduce 函数获取key相同的来自于File1 和 File2 的value list。判断每一个value是来自File1 还是 File2 ,在内部分成2组，做集合的乘积。


存在问题：
	1.map阶段没有对数据进行瘦身，Shuffle时网络传输和排序性能都很低。
	2.Reduce 阶段对两个集合做乘积计算很消耗内存，容易导致OOM。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sQl93C03-1637936128870)(C:\Users\86156\AppData\Roaming\Typora\typora-user-images\image-20210314164832383.png)]$

public class MapRedJoin {
	public static final String DELIMITER = "\u0009"; // 字段分隔符

	// map过程
	public static class MapClass extends MapReduceBase implements
			Mapper<LongWritable, Text, Text, Text> {

		public void configure(JobConf job) {
			super.configure(job);
		}

		public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
				Reporter reporter) throws IOException, ClassCastException {
			// 获取输入文件的全路径和名称
			String filePath = ((FileSplit)reporter.getInputSplit()).getPath().toString();
			// 获取记录字符串
			String line = value.toString();
			// 抛弃空记录
			if (line == null || line.equals("")) return; 

			// 处理来自表A的记录
			if (filePath.contains("m_ys_lab_jointest_a")) {
				String[] values = line.split(DELIMITER); // 按分隔符分割出字段
				if (values.length < 2) return;

				String id = values[0]; // id
				String name = values[1]; // name

				output.collect(new Text(id), new Text("a#"+name));
			}
			// 处理来自表B的记录
			else if (filePath.contains("m_ys_lab_jointest_b")) {
				String[] values = line.split(DELIMITER); // 按分隔符分割出字段
				if (values.length < 3) return;

				String id = values[0]; // id
				String statyear = values[1]; // statyear
				String num = values[2]; //num

				output.collect(new Text(id), new Text("b#"+statyear+DELIMITER+num));
			}
		}
	}

	// reduce过程
	public static class Reduce extends MapReduceBase
			implements Reducer<Text, Text, Text, Text> {
		public void reduce(Text key, Iterator<Text> values,
				OutputCollector<Text, Text> output, Reporter reporter)
				throws IOException {

			Vector<String> vecA = new Vector<String>(); // 存放来自表A的值
			Vector<String> vecB = new Vector<String>(); // 存放来自表B的值

			while (values.hasNext()) {
				String value = values.next().toString();
				if (value.startsWith("a#")) {
					vecA.add(value.substring(2));
				} else if (value.startsWith("b#")) {
					vecB.add(value.substring(2));
				}
			}

			int sizeA = vecA.size();
			int sizeB = vecB.size();

			// 遍历两个向量
			int i, j;
			for (i = 0; i < sizeA; i ++) {
				for (j = 0; j < sizeB; j ++) {
					output.collect(key, new Text(vecA.get(i) + DELIMITER +vecB.get(j)));
				}
			}	
		}
	}

2、Map Join

两份数据中，如果有一份数据比较小，则将小数据全部加载到内存中，并按照关键子建立索引。

大数据文件作为Map 的输入文件，对Map() 函数的每一对数据都能够方便和已经加载到内存中的小数据进行连接。把连接结果按照Key 数据。Reduce 阶段得到的数据都是已经连接好的数据。

这种方法有明显的局限性：有一份数据比较小，在map端，能够把它加载到内存，并进行join操作。

在这里插入图片描述

private static class CustomerBean {		
		private int custId;
		private String name;
		private String address;
		private String phone;

		public CustomerBean() {}

		public CustomerBean(int custId, String name, String address,
				String phone) {
			super();
			this.custId = custId;
			this.name = name;
			this.address = address;
			this.phone = phone;
		}



		public int getCustId() {
			return custId;
		}

		public String getName() {
			return name;
		}

		public String getAddress() {
			return address;
		}

		public String getPhone() {
			return phone;
		}
	}

===================================================================================
	private static class CustOrderMapOutKey implements WritableComparable<CustOrderMapOutKey> {
		private int custId;
		private int orderId;

		public void set(int custId, int orderId) {
			this.custId = custId;
			this.orderId = orderId;
		}

		public int getCustId() {
			return custId;
		}

		public int getOrderId() {
			return orderId;
		}

		@Override
		public void write(DataOutput out) throws IOException {
			out.writeInt(custId);
			out.writeInt(orderId);
		}

		@Override
		public void readFields(DataInput in) throws IOException {
			custId = in.readInt();
			orderId = in.readInt();
		}

		@Override
		public int compareTo(CustOrderMapOutKey o) {
			int res = Integer.compare(custId, o.custId);
			return res == 0 ? Integer.compare(orderId, o.orderId) : res;
		}

		@Override
		public boolean equals(Object obj) {
			if (obj instanceof CustOrderMapOutKey) {
				CustOrderMapOutKey o = (CustOrderMapOutKey)obj;
				return custId == o.custId && orderId == o.orderId;
			} else {
				return false;
			}
		}

		@Override
		public String toString() {
			return custId + "\t" + orderId;
		}
	}

	private static class JoinMapper extends Mapper<LongWritable, Text, CustOrderMapOutKey, Text> {
		private final CustOrderMapOutKey outputKey = new CustOrderMapOutKey();
		private final Text outputValue = new Text();

		/**
		 * 在内存中customer数据
		 */
		private static final Map<Integer, CustomerBean> CUSTOMER_MAP = new HashMap<Integer, Join.CustomerBean>();
		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {

			// 格式: 订单编号	客户编号	订单金额
			String[] cols = value.toString().split("\t");			
			if (cols.length < 3) {
				return;
			}

			int custId = Integer.parseInt(cols[1]);		// 取出客户编号
          =========================================================
			CustomerBean customerBean = CUSTOMER_MAP.get(custId);

			if (customerBean == null) {			// 没有对应的customer信息可以连接
				return;
			}

			StringBuffer sb = new StringBuffer();
			sb.append(cols[2])
				.append("\t")
				.append(customerBean.getName())
				.append("\t")
				.append(customerBean.getAddress())
				.append("\t")
				.append(customerBean.getPhone());

			outputValue.set(sb.toString());
			outputKey.set(custId, Integer.parseInt(cols[0]));

			context.write(outputKey, outputValue);
		}
		===================================================================
		@Override
		protected void setup(Context context)
				throws IOException, InterruptedException {
			FileSystem fs = FileSystem.get(URI.create(CUSTOMER_CACHE_URL), context.getConfiguration());
			FSDataInputStream fdis = fs.open(new Path(CUSTOMER_CACHE_URL));

			BufferedReader reader = new BufferedReader(new InputStreamReader(fdis));
			String line = null;
			String[] cols = null;

			// 格式：客户编号	姓名	地址	电话
			while ((line = reader.readLine()) != null) {
				cols = line.split("\t");
				if (cols.length < 4) {				// 数据格式不匹配，忽略
					continue;
				}

				CustomerBean bean = new CustomerBean(Integer.parseInt(cols[0]), cols[1], cols[2], cols[3]);
				CUSTOMER_MAP.put(bean.getCustId(), bean);
			}
		}
	}

	/**
	 * reduce
	 * @author Ivan
	 *
	 */
	private static class JoinReducer extends Reducer<CustOrderMapOutKey, Text, CustOrderMapOutKey, Text> {
		@Override
		protected void reduce(CustOrderMapOutKey key, Iterable<Text> values, Context context)
				throws IOException, InterruptedException {
			// 什么事都不用做，直接输出
			for (Text value : values) {
				context.write(key, value);
			}
		}
	}
	/**
	 * @param args
	 * @throws Exception
	 */
	public static void main(String[] args) throws Exception {
		if (args.length < 2) {
			new IllegalArgumentException("Usage: <inpath> <outpath>");
			return;
		}

		ToolRunner.run(new Configuration(), new Join(), args);
	}

跟浩哥学大数据

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
MapReduce

MapReduce 知识点一、MapReduce 的核心思想1、Map reduce 分布式运算程序一般分成两个阶段：Map阶段和 Reduce 阶段2、在第一阶段（Map 阶段）所有的mapTask，都是完全并行执行，彼此互不干扰3、在第二阶段（Reduce 阶段）所有的Reduce Task ,都是完全并行执行，彼此互不干扰，但是Reduce task 完全依赖于上一个阶段，即所有MapTask 并发实例的输出。二、MapReduce 编程规范1、Mapper 阶段自定义的ma
复制链接

扫一扫