使用ToolRunner运行Hadoop作业的原理及用法
@(HADOOP)[hadoop, 大数据]
为了简化命令行方式运行作业,方便的调整运行参数,Hadoop自带了一些辅助类。GenericOptionsParser是一个类,用来解释常用的Hadoop命令行选项,并根据需要,为Configuration对象设置相应的取值。通常不直接使用GenericOptionsParser,更方便的方式是:实现Tool接口,通过ToolRunner来运行应用程序,ToolRunner内部调用GenericOptionsParser。
一、示例程序一:打印所有参数
下面是一个简单的程序:
package org.jediael.hadoopdemo.toolrunnerdemo;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolRunnerDemo extends Configured implements Tool {
static {
//Configuration.addDefaultResource("hdfs-default.xml");
//Configuration.addDefaultResource("hdfs-site.xml");
//Configuration.addDefaultResource("mapred-default.xml");
//Configuration.addDefaultResource("mapred-site.xml");
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
for (Entry<String, String> entry : conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new ToolRunnerDemo(), args);
System.exit(exitCode);
}
}
以上程序用于输出作业中加载的属性。
1、直接运行程序
[root@jediael project]#hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo
io.seqfile.compress.blocksize=1000000
keep.failed.task.files=false
mapred.disk.healthChecker.interval=60000
dfs.df.interval=60000
dfs.datanode.failed.volumes.tolerated=0
mapreduce.reduce.input.limit=-1
mapred.task.tracker.http.address=0.0.0.0:50060
mapred.used.genericoptionsparser=true
mapred.userlog.retain.hours=24
dfs.max.objects=0
mapred.jobtracker.jobSchedulable=org.apache.hadoop.mapred.JobSchedulable
mapred.local.dir.minspacestart=0
hadoop.native.lib=true
......................
2、通过-D指定新的参数
[root@jediael project]# hadoop org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo -D color=yello | grep color
color=yello
当然更常用的是覆盖默认的参数,如:
-Ddfs.df.interval=30000
3、通过-conf增加新的配置文件
(1)原有参数数量
[root@jediael project]# hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo | wc
67 67 2994
(2)增加配置文件后的参数数量
[root@jediael project]# hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo -conf /opt/jediael/hadoop-1.2.0/conf/mapred-site.xml | wc
68 68 3028
其中mapred-site.xml的内容如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
可见此文件只有一个property,因此参数数量从67个变成了68个。
4、在代码中增加参数,如上面程序中注释掉的语句
static {
Configuration.addDefaultResource("hdfs-default.xml");
Configuration.addDefaultResource("hdfs-site.xml");
Configuration.addDefaultResource("mapred-default.xml");
Configuration.addDefaultResource("mapred-site.xml");
}
更多选项请见第Configuration的解释。
二、示例程序二:典型用法(修改wordcount程序)
修改经典的wordcount程序,参考:Hadoop入门经典:WordCount
package org.jediael.hadoopdemo.toolrunnerdemo;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool{
public static class WordCountMap extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)