大数据平台（二）——命令行编译打包自己的MapReduce程序

最新推荐文章于 2021-05-02 23:37:33 发布

white先生

最新推荐文章于 2021-05-02 23:37:33 发布

阅读量638

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_36663613/article/details/80005534

版权

大数据专栏收录该内容

7 篇文章 0 订阅

订阅专栏

写在前面

博主的运行环境为Hadoop-2.7.3，集群模式（因硬件简陋，只开了两台虚拟机）在此之前Java运行环境，Hadoop环境已搭好。本文通过WordCount实例向大家分享，将自己编写的mapreduce程序通过编辑在命令行模式下在Hadoop集群下运行。

编译环境配置

因为mapreduce程序的运行需要依赖Hadoop自带的一些jar包，
例如：import org.apache.hadoop.conf.Configuration;
在此之前我们需要弄懂两个概念及在配置环境变量时的PATH和CLASSPATH的区别，这里我给出通俗的解释：
PATH:配置可执行文件（bin或sbin下的文件）的全局模式，如没有配置的话需要在存放执行文件下的目录下去执行，若配了PATH我们可以在任意目录下执行可执行命令。
CLASSPATH：java需要调用外部jar包所配置的环境变量，通过配置它我们的jvm虚拟机才找的到程序所需的jar包。（java自带的jar包就不需要配置了，在配置jre环境变量的时候，我们就配好了）
好了，回到正事上来，我们需要将Hadoop的classhpath信息添加到CLASSPATH 变量中，在 ~/.bashrc 中增加如下几行：
export HADOOP_HOME=/usr/local/hadoop
export CLASSPATH= ( <script type="math/tex" id="MathJax-Element-7">(</script>HADOOP_HOME/bin/hadoop classpath):$CLASSPATH

gedit ~/.bashrc

将上述的内容顶头黏贴到.bashrc文件中来，别忘了执行 source ~/.bashrc 使变量生效，接着就可以通过 javac 命令编译 WordCount.java 了（使用的是 Hadoop 源码中的 WordCount.java，源码在文本最后面）：javac WordCount.java
会产生几个class文件，我们需要将几个class文件打成jar包才嫩运行：
jar -cvf WordCount.jar ./WordCount*.class
至此我们的运行jar包算是完成了，其他mapreduce程序也可照此打成jar包。如再拓展一下：我们可以学习到在命令行去编译Java程序，如过程序需要调用外部jar包，我们需要为外部jar包配置classpath路径。否则编译失败。

运行数据

程序编写及准备工作完成，我们需要准备测试数据。touch file1#新建文件可以多个，再通过gedit或者vim编写内容例echo of the rainbow the waiting game。创建一个input目录将刚新建的文件丢进去。
把本地文件上传到伪分布式HDFS上：

hadoop fs -put ./input /user/xiao
hadoop jar WordCount.jar WordCount /user/xiao/input output

可能会出现找不到类的错误,这是因为我们在代码中设置了package包名，这里也要写全，正确的命令如下:hadoop jar WordCount.jar org/apache/hadoop/examples/WordCount input output
我的结果如下
这里有两点需要注意：第一：系统可能会报input目录找不到的情况，这里需要结合具体情况，因为我没创建Hadoop用户，则需要将input文件夹放置/user/xiao中，如配置了hadoop用户则可直接创建/input目录。第二，再下次运行时需要删除output目录，否则会报错。
该方法也适合单机版和伪分布式，单机版的区别就是不需要将文件put至hdfs中去。

源码

文件位于：hadoop-2.7.3-src\hadoop-mapreduce-project\hadoop-mapreduce-examples\src\main\java\org\apache\hadoop\examples 中：

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
    public WordCount() {
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if(otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }

        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCount.TokenizerMapper.class);
        job.setCombinerClass(WordCount.IntSumReducer.class);
        job.setReducerClass(WordCount.IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        for(int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public IntSumReducer() {
        }

        public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            int sum = 0;

            IntWritable val;
            for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
                val = (IntWritable)i$.next();
            }

            this.result.set(sum);
            context.write(key, this.result);
        }
    }

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public TokenizerMapper() {
        }

        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());

            while(itr.hasMoreTokens()) {
                this.word.set(itr.nextToken());
                context.write(this.word, one);
            }

        }
    }
}

white先生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据平台（二）——命令行编译打包自己的MapReduce程序

写在前面博主的运行环境为Hadoop-2.7.3，集群模式（因硬件简陋，只开了两台虚拟机）在此之前Java运行环境，Hadoop环境已搭好。本文通过WordCount实例向大家分享，将自己编写的mapreduce程序通过编辑在命令行模式下在Hadoop集群下运行。编译环境配置因为mapreduce程序的运行需要依赖Hadoop自带的一些jar包，例如：import org.apac...
复制链接

扫一扫