MapReduce Tutorial
- MapReduce Tutorial
- 针对初级用户的单节点安装
- 针对大型分布式集群的集群安装
- Hadoop Streaming 允许用户把任何可执行文件(比如shell脚本)作为mapper和reducer来创建、运行job。
- Hadoop Pipes 是一种SWIG工具,可兼容C++ API来实现MapReduce。
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
export JAVA_HOME=/usr/java/default
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
$ bin/hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class
- /user/joe/wordcount/input - 输入在HDFS上的目录
- /user/joe/wordcount/output - 输出在HDFS上的目录
$ bin/hadoop fs -ls /user/joe/wordcount/input/ /user/joe/wordcount/input/file01 /user/joe/wordcount/input/file02
$ bin/hadoop fs -cat /user/joe/wordcount/input/file01
Hello World Bye World
$ bin/hadoop fs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output
$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000`
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2`
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
job.setCombinerClass(IntSumReducer.class);
< Bye, 1>
< Hello, 1>
< World, 2>`
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>`
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>`
Payload
Mapper
Reducer NONE
- 一些配置参数可能原先已被管理员标记为final(参考 Final Parameters),因此是不可更改的。
- 有些作业参数设置很简单(比如Job.setNumReduceTasks(int)),但还有一些参数或许和MapReduce框架的其他部分和/或其他作业参数有微妙的关系,设置起来就很复杂。
<name>mapreduce.map.java.opts</name>
<value>
-Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@taskid@.gc
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>
-Xmx1024M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@taskid@.gc
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
</value>
</property>
Name | Type | Description |
---|---|---|
mapreduce.task.io.sort.mb | int | 序列化缓冲区和accounting缓冲区能够存储的map输出结果的累积大小,单位为MB。 |
mapreduce.map.sort.spill.percent | float | 序列化缓冲区的软限制。一旦达到这个阈值,一个线程将在后台开始把缓冲区内容刷到磁盘。 |
- 当一个spill正在运行,而其阈值已被超出时,则剩余所有数据都会被spill到磁盘。例如,如果mapreduce.map.sort.spill.percent 被设为0.33, 缓冲区数据达到0.33时一个spill开始工作;如果在此期间,剩余的0.66已被填满,下一个spill不是只spill0.33的数据,而是把这0.66的数据全都spill到磁盘上。又如:mapreduce.map.sort.spill.percent 被设为0.66 ,这时并不会有第二个spill。换句话说,阈值定义的是一个spill的启动时机,而不是一个阈值块。
- 如果一条记录大于序列化缓冲区,将会首先启动一个spill,然后被spill到一个单独的文件中。这条记录是不是会先被传递给combiner(合成器)并没有定义。
Name | Type | Description |
---|---|---|
mapreduce.task.io.soft.factor | int | 指定同一时刻merge到磁盘上的段数量。它限定了merge期间打开的文件数和压缩编码。如果文件数量超出限制,merge会在多个阶段进行处理。虽然这个限定同样适用于map,但是大多数作业的配置应该确保map小于这个限制。 |
mapreduce.reduce.merge.inmem.thresholds | int | 在merge到磁盘之前,载入内存的map的输出结果数量——如之前所讲的spill阈值,并不是定义了partition的单元,而是一个启动值。实际上,由于在段内存中merge的开销比在磁盘中merge的开销小得多(参见该表下面的注意点),这个值往往设置的很高(1000)或被设置为禁用(0)。这个阈值仅仅影响shuffle阶段在内存中merge的频率。 |
mapreduce.reduce.shuffle.merge.percent | float | 在内存merge之前,抓取map结果的阈值是用分配给map结果的内存所占百分比来表示的。由于没有被装入内存的map结果会被阻塞,所以将该参数设置的高的话会降低抓取map结果和merge之间的并发。相反,如果一个reduce的输入可以全部装入内存,那么该参数设置为1.0可有效提高reduce的效率。只影响shuffle阶段在内存中merge的频率。 |
mapreduce.reduce.shuffle.input.buffer.percent | float | 可分配给用于存储map结果的内存所占百分比,一般是由mapreduce.reduce.java.opts指定的,这个数值和堆内存的最大值有关。虽然应该预留一些内存给MapReduce框架,但总体上将其设置的足够大是有利于存储大而多的map结果的。 |
mapreduce.reduce.input.buffer.percent | float | 与堆内存最大值相关的内存空间会保留到reduce期间。当reduce开始后,map结果会被merge到磁盘,直到剩下的量在这个参数定义的限定阈值以下。默认情况下,在reduce开始之前,所有的map结果会merge到磁盘,以给reduce腾出内存空间。对内存开销不大的reduce而言,这个参数应该调大来避免写磁盘过程中的卡顿。 |
- 如果map结果的数量超过分配给copymap结果的内存的25%,就会被直接写入磁盘而不会在内存中处理。
- 如果配置了combiner,对merge高阈值和大缓冲的推测将不起作用。在所有map结果被抓取以供merge时,combiner在spill到磁盘的同时开始运行。有些情况下,用户可以通过消耗一定资源来对map输出进行combine,从而使spill的数据量更小并且提高spill和抓取(fetch)的并发,最终获得更优的reduce效果————而不是一味粗鲁地增加缓冲区大小。
- 当把内存中的map结果merge到磁盘并开始reduce时,如果因为有些段在spill从而需要二级merge,并且至少有mapreduce.task.io.soft.factor个段已经在磁盘上,那么内存中map结果将成为二级merge的一部分。
Name | Type | Description |
---|---|---|
mapreduce.job.id | String | The job id |
mapreduce.job.jar | String | job.jar location in job directory |
mapreduce.job.local.dir | String | The job specific shared scratch space |
mapreduce.task.id | String | The task id |
mapreduce.task.attempt.id | String | The task attempt id |
mapreduce.task.is.map | boolean | Is this a map task |
mapreduce.task.partition | int | The id of the task within the job |
mapreduce.map.input.file | String | The filename that the map is reading from |
mapreduce.map.input.start | long | The offset of the start of the map input split |
mapreduce.map.input.length | long | The number of bytes in the map input split |
mapreduce.task.output.dir | String | The task’s temporary output directory |
- 检查作业的输入和输出的格式规范。
- 计算作业的InputSplit(输入切片)的值。
- 必要的话,为作业的分布式缓存建立所需的统计信息。
- 拷贝作业的jar包和配置文佳到文件系统的MapReduce系统目录。
- 将作业提交给ResourceManager,监控其状态(可选的)。
- Job.submit() : 向集群提交作业并立刻返回结果。
- Job.waitForCompletion(boolean) : 向集群提交作业并等待作业完成。
- 验证作业的输入格式是否规范。
- 将输入文件切割成逻辑上的InputSplit 实例,每个实例随后被分给一个独立的Mapper。
- 提供 RecordReader的实现(因为RecordReader是抽象类) ,用于从逻辑InputSplit收集输入记录以供Mapper处理。
InputSplit
- 确保作业的输出格式符合规范;比如,检查输出目录不是已经存在的目录。
- 提供RecordWriter的实现用来输出作业的输出文件。输出文件存在文件系统上。
- 初始化阶段创建作业。 比如,在初始化作业期间,创建一个作业结果的临时目录。当各任务初始化完毕并且作业处于预备状态,一个独立的任务将作业创建出来。一旦这个创建作业的任务完成了,作业就转变成运行状态。
- 当作业完成后负责清理作业。比如,删除临时结果的目录。作业完成后,由一个单独的任务负责将其清理。当改任务完成后,会报告作业的状态:SUCCEDED/FAILED/KILLED。
- 输出任务的临时结果。 在初始化任务期间,Task setup is done as part of the same task。(??暂时没理解)
- 检查一个任务是否需要提交。r如果一个任务无需提交,这样可以避免无谓的提交。
- 任务输出的提交。一旦任务完成,如果有需要的话,任务会提交其输出结果。
- 中断任务提交。如果任务失败或被杀死,其输出结果会被清理。如果任务无法清理(发生异常造成了堵塞),另一个具有相同attempt-id的任务会被启动用于清理工作。
- “Private” DistributedCache文件被缓存在本地目录,对用户来说是私有的。这些文件仅被特定的用户的所有任务和作业所共享,其他用户的作业无法访问这些文件。一个DistributedCache文件上传到文件系统上,通常是HDFS,通过其在文件系统上的权限成为私有的。如果文件没有全局读权限或其目录路径对全局不可见,那么文件就会变成私有的。
-
“Public” DistributedCache files are cached in a global directory and the file access is setup such that they are publicly visible to all users. These files can be shared by tasks and jobs of all users on the slaves. A DistributedCache file becomes public by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has world readable access, AND if the directory path leading to the file has world executable access for lookup, then the file becomes public. In other words, if the user intends to make a file publicly available to all users, the file permissions must be set to be world readable, and the directory permissions on the path leading to the file must be world executable.
Profiling
Debugging
Example: WordCount v2.0
Source Code
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;
public class WordCount2 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
@Override
public void setup(Context context) throws IOException,
InterruptedException {
conf = context.getConfiguration();
caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
if (conf.getBoolean("wordcount.skip.patterns", true)) {
URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
for (URI patternsURI : patternsURIs) {
Path patternsPath = new Path(patternsURI.getPath());
String patternsFileName = patternsPath.getName().toString();
parseSkipFile(patternsFileName);
}
}
}
private void parseSkipFile(String fileName) {
try {
fis = new BufferedReader(new FileReader(fileName));
String pattern = null;
while ((pattern = fis.readLine()) != null) {
patternsToSkip.add(pattern);
}
} catch (IOException ioe) {
System.err.println("Caught exception while parsing the cached file '"
+ StringUtils.stringifyException(ioe));
}
}
@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = (caseSensitive) ?
value.toString() : value.toString().toLowerCase();
for (String pattern : patternsToSkip) {
line = line.replaceAll(pattern, "");
}
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
Counter counter = context.getCounter(CountersEnum.class.getName(),
CountersEnum.INPUT_WORDS.toString());
counter.increment(1);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
String[] remainingArgs = optionParser.getRemainingArgs();
if (!(remainingArgs.length != 2 | | remainingArgs.length != 4)) {
System.err.println("Usage: wordcount <in> <out> [-skip skipPatternFile]");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
List<String> otherArgs = new ArrayList<String>();
for (int i=0; i < remainingArgs.length; ++i) {
if ("-skip".equals(remainingArgs[i])) {
job.addCacheFile(new Path(remainingArgs[++i]).toUri());
job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
} else {
otherArgs.add(remainingArgs[i]);
}
}
FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
$ bin/hadoop fs -ls /user/joe/wordcount/input/
/user/joe/wordcount/input/file01
/user/joe/wordcount/input/file02
$ bin/hadoop fs -cat /user/joe/wordcount/input/file01
Hello World, Bye World!
$ bin/hadoop fs -cat /user/joe/wordcount/input/file02
Hello Hadoop, Goodbye to hadoop.
$ bin/hadoop jar wc.jar WordCount2 /user/joe/wordcount/input /user/joe/wordcount/output
$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
Bye 1
Goodbye 1
Hadoop, 1
Hello 2
World! 1
World, 1
hadoop. 1
to 1
$ bin/hadoop fs -cat /user/joe/wordcount/patterns.txt
\.
\,
\!
to
$ bin/hadoop jar wc.jar WordCount2 -Dwordcount.case.sensitive=true /user/joe/wordcount/input /user/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt
$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
Bye 1
Goodbye 1
Hadoop 1
Hello 2
World 2
hadoop 1
$ bin/hadoop jar wc.jar WordCount2 -Dwordcount.case.sensitive=false /user/joe/wordcount/input /user/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt
$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
bye 1
goodbye 1
hadoop 2
hello 2
horld 2
- 示范了在Mapper(和 Reducer)的实现中,程序如何通过setup方法获取配置参数。
- 示范了如何使用DistributedCache来分发作业所需的只读数据。这里它允许用户指定在统计时要忽略的单词模式。
- 展示了GenericOptionsParser处理Hadoop命令行选项的功能。
- 展示了程序如何使用Counters以及程序如何设置传递给map(和reduce)方法的程序状态信息。