第一个MapReduce程序

最新推荐文章于 2022-12-14 08:59:29 发布

kobe_yang24

最新推荐文章于 2022-12-14 08:59:29 发布

阅读量200

点赞数

分类专栏： mapreduce 文章标签： mapreduce

本文链接：https://blog.csdn.net/weixin_39034379/article/details/116202028

版权

mapreduce 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

第一个MapReduce程序

Hadoop1.0版本MapReduce身兼两个角色，既是一个分布式运算程序的编程框架又是一个资源调度框架（2.0被YARN(Yet Another Resource Nagotator 后边会单独介绍）。

核心思想就是分治思想。分而治之（需要注意可以分开的程序必须是不相互影响的，仅仅是数据集合的大小不同而已，仅此而已），最后再合并结果。编程的思路一般也就是map和reduce。

1.第一个mapreduce 程序 WordCount

1.1maven依赖**(版本自己选择，作者用的是3.1.4)

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>RELEASE</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
    </dependencies>

1.2 Mapper程序

自定义mapper需要继承mapreduce的Mapper类，四个泛型分别代表，从文件读到程序中的key value，和mapper程序进行mapper后要输出的key，value,由于java自身的序列化是非常重的，这里都是hadoop自己实现的类型。

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private IntWritable intWritable = new IntWritable(1);

    /**
     *
     * @param key
     * @param value
     * @param context
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) {
        Arrays.asList(value.toString().split(",")).forEach(word -> {
            Text text = new Text();
            text.set(word);
            try {
                context.write(text, intWritable);
            } catch (IOException e) {
                e.printStackTrace();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        });
    }
}

1.3Reduce程序

reduce程序和mapper程序一样，也是四个泛型支持，前两个就是mapper的输出key，value

后边两个是reduce输出到文件上的key，value。

public class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {

    /**
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int result = 0;
        for (IntWritable value : values) {
            result += value.get();
        }
        context.write(key, new IntWritable(result));
    }
}

1.4Job程序

job程序如下所示，注释很详细。
job的程序的编写一般都是下边的八个步骤可以作为魔板。

 @Override
    public int run(String[] args) throws Exception {

        /***
         * mapReduce 八个步骤
         * 第一步：读取文件，解析成key,value对，k1   v1
         * 第二步：自定义map逻辑，接受k1   v1  转换成为新的k2   v2输出
         * 第三步：分区。相同key的数据发送到同一个reduce里面去，key合并，value形成一个集合
         * 第四步：排序   对key2进行排序。字典顺序排序
         * 第五步：规约  combiner过程  调优步骤 可选
         * 第六步：分组
         * 第七步：自定义reduce逻辑接受k2   v2  转换成为新的k3   v3输出
         * 第八步：输出k3  v3 进行保存
         */


        //获取Job对象
        Configuration configuration = super.getConf();
        Job firstMpJob = Job.getInstance(configuration, "yang first mr");

        //想要打成jar包 放在hadoop集群上运行需要设置此步骤
        firstMpJob.setJarByClass(WordCount.class);

        //判断输出路径是否存在存在则删除
        FileSystem fileSystem = FileSystem.get(configuration);
        if (fileSystem.exists(new Path(args[1]))) {
            fileSystem.delete(new Path(args[1]), true);
        }

        //1.inputFormat 将文件中的数据读取出来 lineNumber -> value 的方式
        firstMpJob.setInputFormatClass(TextInputFormat.class);
        //1.2
        FileInputFormat.addInputPath(firstMpJob, new Path(args[0]));

        //2.设置mapper程序
        firstMpJob.setMapperClass(WordCountMapper.class);
        firstMpJob.setMapOutputKeyClass(Text.class);
        firstMpJob.setMapOutputValueClass(IntWritable.class);

        //3,4,5,6 分组 排序 combine 分组 省略

        //7.设置reduce程序
        firstMpJob.setReducerClass(WordCountReduce.class);
        firstMpJob.setOutputKeyClass(Text.class);
        firstMpJob.setOutputValueClass(IntWritable.class);

        //8.设置输出文件
        firstMpJob.setOutputFormatClass(TextOutputFormat.class);
        FileOutputFormat.setOutputPath(firstMpJob, new Path(args[1]));

        return firstMpJob.waitForCompletion(true) ? 0 : 1;
    }

1.5 打包程序在hadoop集群上运行

打包的过程就不在赘述，将jar包放到hadoop集群上后，

执行以下命令：

hadoop jar class arg0 arg1

看到如下日志第一个mapreduce程序已经执行成功：

在这里插入图片描述

查看结果文件

hdfs dfs -cat yourOutPutFile

在这里插入图片描述

kobe_yang24

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第一个MapReduce程序

第一个MapReduce程序Hadoop1.0版本MapReduce身兼两个角色，既是一个分布式运算程序的编程框架又是一个资源调度框架（2.0被YARN(Yet Another Resource Nagotator 后边会单独介绍）。核心思想就是分治思想。分而治之（需要注意可以分开的程序必须是不相互影响的，仅仅是数据集合的大小不同而已，仅此而已），最后再合并结果。编程的思路一般也就是map和reduce。1.第一个mapreduce 程序 WordCount1.1maven依赖**(版本自己选择，作
复制链接

扫一扫

专栏目录