hadoop terasort

最新推荐文章于 2022-03-28 13:02:36 发布

学要无止尽

最新推荐文章于 2022-03-28 13:02:36 发布

阅读量2.2k

点赞数 1

Hadoop 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

1. map reduce 和hadoop起源。

MapReduce借用了函数式编程的概念，是Google发明的一种数据处理模型。因为Google几乎爬了互联网上的所有网页，要为处理这些网页并为搜索引擎建立索引是一项非常艰巨的任务，必须借助成千上万台机器同时工作（也就是分布式并行处理），才有可能完成建立索引的任务。
所以，Google发明了MapReduce数据处理模型，而且他们还就此发表了相关论文。

后来，Doug Cutting老大就根据这篇论文硬生生的复制了一个MapReduce出来，也就是今天的Hadoop。

2.简述mapreduce工作流程

MapReduce处理数据过程主要分成2个阶段：map阶段和reduce阶段。先执行map阶段，再执行reduce阶段。

1) 在正式执行map函数前，需要对输入进行“分片”（就是将海量数据分成大概相等的“块”，hadoop的一个分片默认是64M），以便于多个map同时工作，每一个map任务处理一个“分片”。
2) 分片完毕后，多台机器就可以同时进行map工作了。
   map函数要做的事情，相当于对数据进行“预处理”，输出所要的“关切”。
   map对每条记录的输出以<key,value> pair的形式输出。
3) 在进入reduce阶段之前，还要将各个map中相关的数据（key相同的数据）归结到一起，发往一个reducer。这里面就涉及到多个map的输出“混合地”对应多个reducer的情况，这个过程叫做“洗牌”。
4) 接下来进入reduce阶段。相同的key的map输出会到达同一个reducer。
   reducer对key相同的多个value进行“reduce操作”，最后一个key的一串value经过reduce函数的作用后，变成了一个value。

3Hadoop中terasort的算法分析

1、概述

1TB排序通常用于衡量分布式数据处理框架的数据处理能力。Terasort是Hadoop中的的一个排序作业，在2008年，Hadoop在1TB排序基准评估中赢得第一名，耗时209秒。那么Terasort在Hadoop中是怎样实现的呢？本文主要从算法设计角度分析Terasort作业。

2、算法思想

实际上，当我们要把传统的串行排序算法设计成并行的排序算法时，通常会想到分而治之的策略，即：把要排序的数据划成M个数据块（可以用Hash的方法做到），然后每个map task对一个数据块进行局部排序，之后，一个reduce task对所有数据进行全排序。这种设计思路可以保证在map阶段并行度很高，但在reduce阶段完全没有并行。

为了提高reduce阶段的并行度，TeraSort作业对以上算法进行改进：在map阶段，每个map task都会将数据划分成R个数据块（R为reduce task个数），其中第i（i>0）个数据块的所有数据都会比第i+1个中的数据大；在reduce阶段，第i个reduce task处理（进行排序）所有map task的第i块，这样第i个reduce task产生的结果均会比第i+1个大，最后将1~R个reduce task的排序结果顺序输出，即为最终的排序结果。这种设计思路很明显比第一种高效，但实现难度较大，它需要解决以下两个技术难点：第一，如何确定每个map task数据的R个数据块的范围？第二，对于某条数据，如果快速的确定它属于哪个数据块？答案分别为【采样】和【trie树】。

3、Terasort算法

3.1 Terasort算法流程

对于Hadoop的Terasort排序算法，主要由3步组成：采样 –>> map task对于数据记录做标记 –>> reduce task进行局部排序。

数据采样在JobClient端进行，首先从输入数据中抽取一部分数据，将这些数据进行排序，然后将它们划分成R个数据块，找出每个数据块的数据上限和下线（称为“分割点”），并将这些分割点保存到分布式缓存中。

在map阶段，每个map task首先从分布式缓存中读取分割点，并对这些分割点建立trie树（两层trie树，树的叶子节点上保存有该节点对应的reduce task编号）。然后正式开始处理数据，对于每条数据，在trie树中查找它属于的reduce task的编号，并保存起来。

在reduce阶段，每个reduce task从每个map task中读取其对应的数据进行局部排序，最后将reduce task处理后结果按reduce task编号依次输出即可。

3.2 Terasort算法关键点

（1）采样

Hadoop自带了很多数据采样工具，包括IntercalSmapler，RandomSampler，SplitSampler等（具体见org.apache.hadoop.mapred.lib）。

采样数据条数：sampleSize = conf.getLong(“terasort.partitions.sample”, 100000);

选取的split个数：samples = Math.min(10, splits.length); splits是所有split组成的数组。

每个split提取的数据条数：recordsPerSample = sampleSize / samples;

对采样的数据进行全排序，将获取的“分割点”写到文件_partition.lst中，并将它存放到分布式缓存区中。

举例说明：比如采样数据为b，abc，abd，bcd，abcd，efg，hii，afd，rrr，mnk

经排序后，得到：abc，abcd，abd，afd，b，bcd，efg，hii，mnk，rrr

如果reduce task个数为4，则分割点为：abd，bcd，mnk

（2）map task对数据记录做标记

每个map task从文件_partition.lst读取分割点，并创建trie树（假设是2-trie，即组织利用前两个字节）。

Map task从split中一条一条读取数据，并通过trie树查找每条记录所对应的reduce task编号。比如：abg对应第二个reduce task， mnz对应第四个reduce task。

（3）reduce task进行局部排序

每个reduce task进行局部排序，依次输出结果即可。

4.Tera-sort 源码

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 
package org.apache.hadoop.examples.terasort;
 
import java.io.IOException;
import java.io.PrintStream;
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
 
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
/**
 * Generates the sampled split points, launches the job, and waits for it to
 * finish.
 * <p>
 * To run the program:
 * <b>bin/hadoop jar hadoop-*-examples.jar terasort in-dir out-dir</b>
 */
public class TeraSort extends Configured implements Tool {
  private static final Log LOG = LogFactory.getLog(TeraSort.class);
 
  /**
   * A partitioner that splits text keys into roughly equal partitions
   * in a global sorted order.
   */
  static class TotalOrderPartitioner implements Partitioner<Text,Text>{
    private TrieNode trie;
    private Text[] splitPoints;
 
    /**
     * A generic trie node
     */
    static abstract class TrieNode {
      private int level;
      TrieNode(int level) {
        this.level = level;
      }
      abstract int findPartition(Text key);
      abstract void print(PrintStream strm) throws IOException;
      int getLevel() {
        return level;
      }
    }
 
    /**
     * An inner trie node that contains 256 children based on the next
     * character.
     */
    static class InnerTrieNode extends TrieNode {
      private TrieNode[] child = new TrieNode[256];
     
      InnerTrieNode(int level) {
        super(level);
      }
      int findPartition(Text key) {
        int level = getLevel();
        if (key.getLength() <= level) {
          return child[0].findPartition(key);
        }
        return child[key.getBytes()[level]].findPartition(key);
      }
      void setChild(int idx, TrieNode child) {
        this.child[idx] = child;
      }
      void print(PrintStream strm) throws IOException {
        for(int ch=0; ch < 255; ++ch) {
          for(int i = 0; i < 2*getLevel(); ++i) {
            strm.print(' ');
          }
          strm.print(ch);
          strm.println(" ->");
          if (child[ch] != null) {
            child[ch].print(strm);
          }
        }
      }
    }
 
    /**
     * A leaf trie node that does string compares to figure out where the given
     * key belongs between lower..upper.
     */
    static class LeafTrieNode extends TrieNode {
      int lower;
      int upper;
      Text[] splitPoints;
      LeafTrieNode(int level, Text[] splitPoints, int lower, int upper) {
        super(level);
        this.splitPoints = splitPoints;
        this.lower = lower;
        this.upper = upper;
      }
      int findPartition(Text key) {
        for(int i=lower; i<upper; ++i) {
          if (splitPoints[i].compareTo(key) >= 0) {
            return i;
          }
        }
        return upper;
      }
      void print(PrintStream strm) throws IOException {
        for(int i = 0; i < 2*getLevel(); ++i) {
          strm.print(' ');
        }
        strm.print(lower);
        strm.print(", ");
        strm.println(upper);
      }
    }
 
 
    /**
     * Read the cut points from the given sequence file.
     * @param fs the file system
     * @param p the path to read
     * @param job the job config
     * @return the strings to split the partitions on
     * @throws IOException
     */
    private static Text[] readPartitions(FileSystem fs, Path p,
                                         JobConf job) throws IOException {
      SequenceFile.Reader reader = new SequenceFile.Reader(fs, p, job);
      List<Text> parts = new ArrayList<Text>();
      Text key = new Text();
      NullWritable value = NullWritable.get();
      while (reader.next(key, value)) {
        parts.add(key);
        key = new Text();
      }
      reader.close();
      return parts.toArray(new Text[parts.size()]); 
    }
 
    /**
     * Given a sorted set of cut points, build a trie that will find the correct
     * partition quickly.
     * @param splits the list of cut points
     * @param lower the lower bound of partitions 0..numPartitions-1
     * @param upper the upper bound of partitions 0..numPartitions-1
     * @param prefix the prefix that we have already checked against
     * @param maxDepth the maximum depth we will build a trie for
     * @return the trie node that will divide the splits correctly
     */
    private static TrieNode buildTrie(Text[] splits, int lower, int upper,
                                      Text prefix, int maxDepth) {
      int depth = prefix.getLength();
      if (depth >= maxDepth || lower == upper) {
        return new LeafTrieNode(depth, splits, lower, upper);
      }
      InnerTrieNode result = new InnerTrieNode(depth);
      Text trial = new Text(prefix);
      // append an extra byte on to the prefix
      trial.append(new byte[1], 0, 1);
      int currentBound = lower;
      for(int ch = 0; ch < 255; ++ch) {
        trial.getBytes()[depth] = (byte) (ch + 1);
        lower = currentBound;
        while (currentBound < upper) {
          if (splits[currentBound].compareTo(trial) >= 0) {
            break;
          }
          currentBound += 1;
        }
        trial.getBytes()[depth] = (byte) ch;
        result.child[ch] = buildTrie(splits, lower, currentBound, trial,
                                     maxDepth);
      }
      // pick up the rest
      trial.getBytes()[depth] = 127;
      result.child[255] = buildTrie(splits, currentBound, upper, trial,
                                    maxDepth);
      return result;
    }
 
    public void configure(JobConf job) {
      try {
        FileSystem fs = FileSystem.getLocal(job);
        Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }
    }
 
    public TotalOrderPartitioner() {
    }
 
    public int getPartition(Text key, Text value, int numPartitions) {
      return trie.findPartition(key);
    }
   
  }
 
  public int run(String[] args) throws Exception {
    LOG.info("starting");
    JobConf job = (JobConf) getConf();
    Path inputDir = new Path(args[0]);
    inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
    Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
    URI partitionUri = new URI(partitionFile.toString() +
                               "#" + TeraInputFormat.PARTITION_FILENAME);
    TeraInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.setJobName("TeraSort");
    job.setJarByClass(TeraSort.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormat(TeraInputFormat.class);
    job.setOutputFormat(TeraOutputFormat.class);
    job.setPartitionerClass(TotalOrderPartitioner.class);
    TeraInputFormat.writePartitionFile(job, partitionFile);
    DistributedCache.addCacheFile(partitionUri, job);
    DistributedCache.createSymlink(job);
    job.setInt("dfs.replication", 1);
    TeraOutputFormat.setFinalSync(job, true);
    JobClient.runJob(job);
    LOG.info("done");
    return 0;
  }
 
  /**
   * @param args
   */
  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new JobConf(), new TeraSort(), args);
    System.exit(res);
  }
 
}