hadoop学习笔记4-MapReduce

最新推荐文章于 2021-09-23 11:18:20 发布

一直想成为大神的菜鸟

最新推荐文章于 2021-09-23 11:18:20 发布

阅读量146

点赞数

本文链接：https://blog.csdn.net/qq_35653822/article/details/113932739

版权

一.MapReduce基础

1.分布式计算框架

2.又是来自于论文

3.优点：海量数据离线处理，运行在廉价机器

4.缺点：不适合实时处理

5.入门：wordcount

很多场景都是wordcount的延伸，比如统计top n

6.分而治之的思想，过程与归并排序相似
中间shuffle步骤很重要：本来split分布到不同机器上了，shuffle把一样的单词放到同一个地方
最终reduce后把计算结果输出到hdfs上

7.mapreduce整个过程是键值对的处理

8.核心概念split

交由mapreduce处理的数据块，默认一个block(hdfs存储的数据块，默认大小128M)对应一个split

二.mapreduce执行流程

map reduce先映射再归并分为以下几步
input 输入数据
split 分割数据按照一定规则分割比如split(" ")
mapping 映射数据把每个刚分割的数据都以key value的形式数量赋值为1
shuffle 重新洗牌因为数据会分布在不同机器上把同一个单词发送到同一个地方
reduce 合并计算把相同的单词相加算出每个单词出现的次数

三.java调用mapreduce

1.InputFormat(接口)/FileInputFormat(抽象类)
InputFormat接口有很多抽象类，常用的是FileInputFormat
这个接口里边有split，recordReader
2.OutPutFormat(接口)
这个接口里边有recordWriter

注意使用mapreduce下的而不是老版本mapred下的

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

四.java程序调用mapreduce实现wordcount

先继承Mapper类，重写map方法
再继承Reducer类，重写reduce方法
都是把运行结果存到上下文中
（这个代码有bug，我先给你留个悬念，你往下看）

package com.imooc.hadoop.mapreduce;

import com.google.code.useragent.UserAgentParser;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 使用MapReduce开发WordCount应用程序
 * 升级版  先删除输出路径已存在的
 */
public class WordCount2App {

    /**
     * Map：读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        LongWritable one = new LongWritable(1);

        /**
         * 一行记录执行一次
         * @param key
         * @param value
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();
            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            for(String word :  words) {
                // 通过上下文把map的处理结果输出
                context.write(new Text(word), one);
            }

        }
    }

    /**
     * Reduce：归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    /**
     * 定义Driver：封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        // 准备清理已存在的输出目录
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
       // if(fileSystem.exists(outputPath)){
           // fileSystem.delete(outputPath, true);
          //  System.out.println("output file exists, but is has deleted");
        //}

        //创建Job
        Job job = Job.getInstance(configuration, "logApp");

        //设置job的处理类
        job.setJarByClass(WordCount2App.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

五.把上述代码打成jar包运行

命令：

hadoop jar jar包名称.jar  运行类   入参1  入参2  入参n

我的命令

hadoop jar ~/lib/hadoop-train-1.0-jar-with-dependencies.jar com.imooc.hadoop.project.LogApp /10000_access.log /browserout

上边这个命令中， /10000_access.log /browserout这两个路径都是hdfs上的路径，并且输出路径/browserout时hdfs上本不存在的一个路径

六.输出路径优化

上边这个命令第一次运行是成功的，第二次失败，为啥呢？在MapReduce中，输出文件不能事先存在，会报错

如何解决？

两种方式，1.在代码中判断这个路径是否在hdfs上存在，如果存在就删除 2.把命令写成一个脚本，在shell脚本中删除

建议使用第一种方法

FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
	fileSystem.delete(outputPath, true);
	System.out.println("output file exists, but is has deleted");
}

第二种方法：写一个脚本，一共就两句，第一句是hdfs dfs -rm 文件第二局是执行hadoop jar命令

七.combiner

在reduce之前做一次合并，
与shuffle区别：combiner在map步骤进行，也就是combiner在shuffle之前

  job.setCombinerClass(MyReducer.class);

八.partitioner

也是在map端，在reduce之前
把符合规则的key分给不同的reduce处理

1.根据业务需要，产生多个输出文件
2.多个reduce任务在运行，提高整体job的运行效率

继承Partitioner类，重写getPartition方法
返回类型int是返回分区数

  job.setCombinerClass(MyReducer.class);
  public int getPartition(Text key, LongWritable value, int numPartitions) {

            if(key.toString().equals("xiaomi")) {
                return 0;
            }

            if(key.toString().equals("huawei")) {
                return 1;
            }

            if(key.toString().equals("iphone7")) {
                return 2;
            }

            return 3;
        }

partitioner和combiner可以不放在一起使用

九.JobHistory

历史运行情况以及日志，便于定位问题

一直想成为大神的菜鸟

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hadoop学习笔记4-MapReduce

一.MapReduce基础1.分布式计算框架2.又是来自于论文3.优点：海量数据离线处理，运行在廉价机器4.缺点：不适合实时处理5.入门：wordcount很多场景都是wordcount的延伸，比如统计top n6.分而治之的思想，过程与归并排序相似中间shuffle步骤很重要：本来split分布到不同机器上了，shuffle把一样的单词放到同一个地方最终reduce后把计算结果输出到hdfs上7.mapreduce整个过程是键值对的处理8.核心概念split交由ma
复制链接

扫一扫