Hadoop高效执行ToolRunner

筱白熊

于 2021-03-21 15:59:54 发布

阅读量516

点赞数

分类专栏： hadoop mapreduce 文章标签：大数据 hadoop mapreduce

本文链接：https://blog.csdn.net/Phalaris_arundinacea/article/details/115050659

版权

hadoop 同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

mapreduce

10 篇文章 0 订阅

订阅专栏

Hadoop高效执行ToolRunner

使用`ToolRunner`的原因

关于 MapReduce 运行和参数配置的缺点

将 MapReduce Job 配置参数写到 java 代码里，一旦变更意味着修改 java 文件源码、编译、打包、部署一连串事情。
当 MapReduce 依赖配置文件的时候，需要手工编写 java 代码使用 DistributedCache 将其上传到 HDFS 中，以便 map 和 reduce 函数可以读取。
当使用map 或 reduce 函数依赖第三方 jar 文件时，在命令行中使用”-libjars”参数指定依赖 jar 包时，但根本没生效。

ToolRunner可以并发执行，在终端执行时可以指定参数。

`GenericOptionsParser`介绍

功能

GenericOptionsParser可以将命令行中参数自动设置到变量 conf 中。不需要在代码中配置参数。

优点

不需要将其硬编码到 java 代码中，很轻松就可以将参数与代码分离开。

例子

`GenericOptionsParser` 解析命令行参数

WordCount.java

public class WordCount {
    // 略...
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, 
                                            args).getRemainingArgs();
        // 略...
        Job job = new Job(conf, "word count");
        // 略...
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

通过命令行设置 reduce task 数量

bin/hadoop jar MyJob.jar com.xxx.MyJobDriver -Dmapred.reduce.tasks=5

常用的参数`-libjars`和`-files`

bin/hadoop jar MyJob.jar com.xxx.MyJobDriver -Dmapred.reduce.tasks=5 \ 
    -files ./dict.conf  \
    -libjars lib/commons-beanutils-1.8.3.jar,lib/commons-digester-2.1.jar

参数-libjars的作用是上传本地 jar 包到 HDFS 中 MapReduce 临时目录并将其设置到 map 和 reduce task 的 classpath 中；参数-files的作用是上传指定文件到 HDFS 中 mapreduce 临时目录，并允许 map 和 reduce task 读取到它。这两个配置参数其实都是通过 DistributeCache 来实现的。

`ToolRunner`和`GenericOptionsParser`配合使用

public class WordCount extends Configured implements Tool {
    @Override
    public int run(String[] arg0) throws Exception {
        Job job = new Job(getConf(), "word count");
        // 略...
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        return 0;
    }
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordCount(), args);
        System.exit(res);
    }
}

使用ToolRunner后可以将 GenericOptionsParser 调用隐藏到自身 run 方法，自动执行。

`ToolRunner`和`Derive`的区别

让 WordCount 继承 Configured 并实现 Tool 接口。
重写 Tool 接口的 run 方法，run方法不是 static 类型，性能很好。
在 WordCount中通过 getConf() 获取 Configuration 对象。

实现思路

Drive类继承org.apache.hadoop.conf.Configured类
Drive类实现org.apache.hadoop.util.Tool接口，实现run`方法
run方法完成Job任务配置
在main()方法中实例化Drive类，调用run()方法

案例

词频统计

package hadoop.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;

import java.io.IOException;
import java.util.Iterator;

public class WordCountDriveTool extends Configured implements Tool {

    static class WordCountMapper extends Mapper<LongWritable, Text,Text,LongWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            // 切分词汇
            String[] values = value.toString().split("\t");
            // 遍历输出
            for (String word : values) {
                context.write(new Text(word),new LongWritable(1));
            }
        }
    }

    static class WordCountReduce extends Reducer<Text,LongWritable,Text,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            // 统计每个单词的数量
            Iterator<LongWritable> iterator = values.iterator();
            long count=0L;
            while (iterator.hasNext()) {
                count+=iterator.next().get();
            }
            context.write(key,new LongWritable(count));
        }
    }


    @Override
    public int run(String[] args) throws Exception {
        setConf(new Configuration());
        Job job = Job.getInstance(getConf(), this.getClass().getName());
        job.setJarByClass(WordCountDriveTool.class);
        job.setMapperClass(WordCountMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        job.setReducerClass(WordCountReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        return job.waitForCompletion(true)?1:0;
    }

    public static void main(String[] args) throws Exception {
        args=new String[]{"D:\\BigData\\hadoop\\mr\\wordcount\\wcinput","D:\\BigData\\hadoop\\mr\\wordcount\\wcoutput"};
        WordCountDriveTool driveTool = new WordCountDriveTool();
        driveTool.run(args);
    }
}

筱白熊

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Hadoop高效执行ToolRunner

Hadoop高效执行ToolRunner使用ToolRunner的原因关于 MapReduce 运行和参数配置的缺点将 MapReduce Job 配置参数写到 java 代码里，一旦变更意味着修改 java 文件源码、编译、打包、部署一连串事情。当 MapReduce 依赖配置文件的时候，需要手工编写 java 代码使用 DistributedCache 将其上传到 HDFS 中，以便 map 和 reduce 函数可以读取。当使用map 或 reduce 函数依赖第三方 jar 文件时，在命
复制链接

扫一扫