1. map reduce 和hadoop起源。
MapReduce借用了函数式编程的概念,是Google发明的一种数据处理模型。因为Google几乎爬了互联网上的所有网页,要为处理这些网页并为搜索引擎建立索引是一项非常艰巨的任务,必须借助成千上万台机器同时工作(也就是分布式并行处理),才有可能完成建立索引的任务。
所以,Google发明了MapReduce数据处理模型,而且他们还就此发表了相关论文。
后来,Doug Cutting老大就根据这篇论文硬生生的复制了一个MapReduce出来,也就是今天的Hadoop。
2.简述mapreduce工作流程
MapReduce处理数据过程主要分成2个阶段:map阶段和reduce阶段。先执行map阶段,再执行reduce阶段。
1) 在正式执行map函数前,需要对输入进行“分片”(就是将海量数据分成大概相等的“块”,hadoop的一个分片默认是64M),以便于多个map同时工作,每一个map任务处理一个“分片”。
2) 分片完毕后,多台机器就可以同时进行map工作了。
map函数要做的事情,相当于对数据进行“预处理”,输出所要的“关切”。
map对每条记录的输出以<key,value> pair的形式输出。
3) 在进入reduce阶段之前,还要将各个map中相关的数据(key相同的数据)归结到一起,发往一个reducer。这里面就涉及到多个map的输出“混合地”对应多个reducer的情况,这个过程叫做“洗牌”。
4) 接下来进入reduce阶段。相同的key的map输出会到达同一个reducer。
reducer对key相同的多个value进行“reduce操作”,最后一个key的一串value经过reduce函数的作用后,变成了一个value。
3Hadoop中terasort的算法分析
1、概述
1TB排序通常用于衡量分布式数据处理框架的数据处理能力。Terasort是Hadoop中的的一个排序作业,在2008年,Hadoop在1TB排序基准评估中赢得第一名,耗时209秒。那么Terasort在Hadoop中是怎样实现的呢?本文主要从算法设计角度分析Terasort作业。
2、算法思想
实际上,当我们要把传统的串行排序算法设计成并行的排序算法时,通常会想到分而治之的策略,即:把要排序的数据划成M个数据块(可以用Hash的方法做到),然后每个map task对一个数据块进行局部排序,之后,一个reduce task对所有数据进行全排序。这种设计思路可以保证在map阶段并行度很高,但在reduce阶段完全没有并行。
为了提高reduce阶段的并行度,TeraSort作业对以上算法进行改进:在map阶段,每个map task都会将数据划分成R个数据块(R为reduce task个数),其中第i(i>0)个数据块的所有数据都会比第i+1个中的数据大;在reduce阶段,第i个reduce task处理(进行排序)所有map task的第i块,这样第i个reduce task产生的结果均会比第i+1个大,最后将1~R个reduce task的排序结果顺序输出,即为最终的排序结果。这种设计思路很明显比第一种高效,但实现难度较大,它需要解决以下两个技术难点:第一,如何确定每个map task数据的R个数据块的范围? 第二,对于某条数据,如果快速的确定它属于哪个数据块?答案分别为【采样】和【trie树】。
3、Terasort算法
3.1 Terasort算法流程
对于Hadoop的Terasort排序算法,主要由3步组成:采样 –>> map task对于数据记录做标记 –>> reduce task进行局部排序。
数据采样在JobClient端进行,首先从输入数据中抽取一部分数据,将这些数据进行排序,然后将它们划分成R个数据块,找出每个数据块的数据上限和下线(称为“分割点”),并将这些分割点保存到分布式缓存中。
在map阶段,每个map task首先从分布式缓存中读取分割点,并对这些分割点建立trie树(两层trie树,树的叶子节点上保存有该节点对应的reduce task编号)。然后正式开始处理数据,对于每条数据,在trie树中查找它属于的reduce task的编号,并保存起来。
在reduce阶段,每个reduce task从每个map task中读取其对应的数据进行局部排序,最后将reduce task处理后结果按reduce task编号依次输出即可。
3.2 Terasort算法关键点
(1)采样
Hadoop自带了很多数据采样工具,包括IntercalSmapler,RandomSampler,SplitSampler等(具体见org.apache.hadoop.mapred.lib)。
采样数据条数:sampleSize = conf.getLong(“terasort.partitions.sample”, 100000);
选取的split个数:samples = Math.min(10, splits.length); splits是所有split组成的数组。
每个split提取的数据条数:recordsPerSample = sampleSize / samples;
对采样的数据进行全排序,将获取的“分割点”写到文件_partition.lst中,并将它存放到分布式缓存区中。
举例说明:比如采样数据为b,abc,abd,bcd,abcd,efg,hii,afd,rrr,mnk
经排序后,得到:abc,abcd,abd,afd,b,bcd,efg,hii,mnk,rrr
如果reduce task个数为4,则分割点为:abd,bcd,mnk
(2)map task对数据记录做标记
每个map task从文件_partition.lst读取分割点,并创建trie树(假设是2-trie,即组织利用前两个字节)。
Map task从split中一条一条读取数据,并通过trie树查找每条记录所对应的reduce task编号。比如:abg对应第二个reduce task, mnz对应第四个reduce task。
(3)reduce task进行局部排序
每个reduce task进行局部排序,依次输出结果即可。
4.Tera-sort 源码
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples.terasort;
import java.io.IOException;
import java.io.PrintStream;
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* Generates the sampled split points, launches the job, and waits for it to
* finish.
* <p>
* To run the program:
* <b>bin/hadoop jar hadoop-*-examples.jar terasort in-dir out-dir</b>
*/
public class TeraSort extends Configured implements Tool {
private static final Log LOG = LogFactory.getLog(TeraSort.class);
/**
* A partitioner that splits text keys into roughly equal partitions
* in a global sorted order.
*/
static class TotalOrderPartitioner implements Partitioner<Text,Text>{
private TrieNode trie;
private Text[] splitPoints;
/**
* A generic trie node
*/
static abstract class TrieNode {
private int level;
TrieNode(int level) {
this.level = level;
}
abstract int findPartition(Text key);
abstract void print(PrintStream strm) throws IOException;
int getLevel() {
return level;
}
}
/**
* An inner trie node that contains 256 children based on the next
* character.
*/
static class InnerTrieNode extends TrieNode {
private TrieNode[] child = new TrieNode[256];
InnerTrieNode(int level) {
super(level);
}
int findPartition(Text key) {
int level = getLevel();
if (key.getLength() <= level) {
return child[0].findPartition(key);
}
return child[key.getBytes()[level]].findPartition(key);
}
void setChild(int idx, TrieNode child) {
this.child[idx] = child;
}
void print(PrintStream strm) throws IOException {
for(int ch=0; ch < 255; ++ch) {
for(int i = 0; i < 2*getLevel(); ++i) {
strm.print(' ');
}
strm.print(ch);
strm.println(" ->");
if (child[ch] != null) {
child[ch].print(strm);
}
}
}
}
/**
* A leaf trie node that does string compares to figure out where the given
* key belongs between lower..upper.
*/
static class LeafTrieNode extends TrieNode {
int lower;
int upper;
Text[] splitPoints;
LeafTrieNode(int level, Text[] splitPoints, int lower, int upper) {
super(level);
this.splitPoints = splitPoints;
this.lower = lower;
this.upper = upper;
}
int findPartition(Text key) {
for(int i=lower; i<upper; ++i) {
if (splitPoints[i].compareTo(key) >= 0) {
return i;
}
}
return upper;
}
void print(PrintStream strm) throws IOException {
for(int i = 0; i < 2*getLevel(); ++i) {
strm.print(' ');
}
strm.print(lower);
strm.print(", ");
strm.println(upper);
}
}
/**
* Read the cut points from the given sequence file.
* @param fs the file system
* @param p the path to read
* @param job the job config
* @return the strings to split the partitions on
* @throws IOException
*/
private static Text[] readPartitions(FileSystem fs, Path p,
JobConf job) throws IOException {
SequenceFile.Reader reader = new SequenceFile.Reader(fs, p, job);
List<Text> parts = new ArrayList<Text>();
Text key = new Text();
NullWritable value = NullWritable.get();
while (reader.next(key, value)) {
parts.add(key);
key = new Text();
}
reader.close();
return parts.toArray(new Text[parts.size()]);
}
/**
* Given a sorted set of cut points, build a trie that will find the correct
* partition quickly.
* @param splits the list of cut points
* @param lower the lower bound of partitions 0..numPartitions-1
* @param upper the upper bound of partitions 0..numPartitions-1
* @param prefix the prefix that we have already checked against
* @param maxDepth the maximum depth we will build a trie for
* @return the trie node that will divide the splits correctly
*/
private static TrieNode buildTrie(Text[] splits, int lower, int upper,
Text prefix, int maxDepth) {
int depth = prefix.getLength();
if (depth >= maxDepth || lower == upper) {
return new LeafTrieNode(depth, splits, lower, upper);
}
InnerTrieNode result = new InnerTrieNode(depth);
Text trial = new Text(prefix);
// append an extra byte on to the prefix
trial.append(new byte[1], 0, 1);
int currentBound = lower;
for(int ch = 0; ch < 255; ++ch) {
trial.getBytes()[depth] = (byte) (ch + 1);
lower = currentBound;
while (currentBound < upper) {
if (splits[currentBound].compareTo(trial) >= 0) {
break;
}
currentBound += 1;
}
trial.getBytes()[depth] = (byte) ch;
result.child[ch] = buildTrie(splits, lower, currentBound, trial,
maxDepth);
}
// pick up the rest
trial.getBytes()[depth] = 127;
result.child[255] = buildTrie(splits, currentBound, upper, trial,
maxDepth);
return result;
}
public void configure(JobConf job) {
try {
FileSystem fs = FileSystem.getLocal(job);
Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
splitPoints = readPartitions(fs, partFile, job);
trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
} catch (IOException ie) {
throw new IllegalArgumentException("can't read paritions file", ie);
}
}
public TotalOrderPartitioner() {
}
public int getPartition(Text key, Text value, int numPartitions) {
return trie.findPartition(key);
}
}
public int run(String[] args) throws Exception {
LOG.info("starting");
JobConf job = (JobConf) getConf();
Path inputDir = new Path(args[0]);
inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
URI partitionUri = new URI(partitionFile.toString() +
"#" + TeraInputFormat.PARTITION_FILENAME);
TeraInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TeraSort");
job.setJarByClass(TeraSort.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormat(TeraInputFormat.class);
job.setOutputFormat(TeraOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TeraInputFormat.writePartitionFile(job, partitionFile);
DistributedCache.addCacheFile(partitionUri, job);
DistributedCache.createSymlink(job);
job.setInt("dfs.replication", 1);
TeraOutputFormat.setFinalSync(job, true);
JobClient.runJob(job);
LOG.info("done");
return 0;
}
/**
* @param args
*/
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new JobConf(), new TeraSort(), args);
System.exit(res);
}
}