MapReduce入门简介

最新推荐文章于 2020-04-30 16:19:47 发布

爱吹牛的猫

最新推荐文章于 2020-04-30 16:19:47 发布

阅读量262

点赞数 2

分类专栏： Hadoop笔记文章标签： MapReduce 大数据

本文链接：https://blog.csdn.net/qq_43755771/article/details/90956421

版权

Hadoop笔记专栏收录该内容

13 篇文章 0 订阅

订阅专栏

MapReduce简介

MapReduce是一个软件框架，用于轻松编写应用程序，以可靠，容错的方式在大型集群（数千个节点）的商用硬件上并行处理大量数据（多TB数据集）。
MapReduce 作业通常将输入数据集拆分为独立的块，这些块由map任务以完全并行的方式处理。框架对map的输出进行排序，然后输入到reduce任务。通常，作业的输入和输出都存储在文件系统中。该框架负责调度任务，监视任务并重新执行失败的任务。
官网地址
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

job提交图解

在这里插入图片描述
注：appmaster属于mapreduce框架，不属于yarn（是nodemanager中的一个程序）
nodemanager如何知道自己分配到了任务？
3秒钟心跳机制，rm将自己的分配方案记录到调度队列里放到job.plit文件里面

MapReduce图解

在这里插入图片描述
分片
源码：min(min分片（1）,max(blocksize,max分片（long的最大值）))（所以默认分片就是分块大小）

/** 
   * Generate the list of files and make them into FileSplits.
   * @param job the job context
   * @throws IOException
   */
  public List<InputSplit> getSplits(JobContext job) throws IOException {
    StopWatch sw = new StopWatch().start();
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
  }

  protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

maptask数量：文件数量数据量分片大小

分片是一个逻辑概念，分片信息包括起始偏移量，分片大小，分片数据所在的块的信息,块所在的主机列表。每一个分片对应着一个maptask,通过调整分片的大小可以调整maptask的数量，也就是调整map阶段的并行度。

long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));//返回1
long maxSize = getMaxSplitSize(job);//返回long的最大值
long splitSize = computeSplitSize(blockSize, minSize, maxSize)
return Math.max(minSize, Math.min(maxSize, blockSize));
//计算分片大小
 while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {} //分片有一个1.1倍的冗余

分片机制
(1)分片读取规则
①第一个分片从第一行（偏移量为零）开始读取，读到分片末尾，再读取下一个分片的第一行
②既不是第一个分片也不是最后一个分片，第一行数据舍去，读到分片末尾，再继续读取下一个分片的第一行数据
③最后一个分片舍去第一行，读到分片末尾
设置分片大小
FileInputFormat.setMinInputSplitSize(job,1000);
FileInputFormat.setMaxInputSplitSize(job,1000000);

判断是否可以分片，文件若是压缩的需要判断
map读取数据源码查看
src包里的InputFormat中的FileInputFormat中的TextInputFormat中的createRecordReader（中有一个LineRecordReader中的nextKeyValue、getCurrentKey、getCurrentValue流式读取数据）

shuffle（图解）

在这里插入图片描述
环形缓冲区20%保留区，80%溢写磁盘，溢写到spiller（溢写器）
在此期间3个数组管理（kv）元数据
shuffle指的就是map端无规则的输出到reduce的输入

mr程序编写：

用户编写mapreduce程序分为3个部分，Mapper，Reducer，Driver（提交运行mr程序的客户端）
Mapper的输入数据是kv键值对的形式（kv的数据类型是可以自定义）
Mapper的输出数据是kv键值对的形式（kv的数据类型是可以自定义）
Mapper中的业务逻辑写在map方法中
map方法对每一个maptask进程传入的kv键值对数据调用一次
Reduce的输入数据是kv键值对的形式（kv的数据类型是可以自定义）
Reduce的输出数据是kv键值对的形式（kv的数据类型是可以自定义）
Reduce中的业务逻辑写在reduce方法中
reducetask进程对每一组key相同的kv键值对数据调用一次reduce方法
用户自定义的Mapper和Reducer都要继承自各自的父类
整个程序需要一个Driver来进行提交，提交的是一个描述了各种必要信息的job对象

举例
map

package pvcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
 * 框架在调研共我们写的map方法业务时，会将数据作为参数（一个key,一个value）传入到map方法中
 * Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 * KEYIN:是框架（maptask）要传递给map方法的输入的参数中key的数据类型
 * VALUEIN：是框架（maptask）要传递给map方法的输入的参数中key的数据类型
 * 在默认情况下，框架传入的key是框架从待处理的数据文件（文本文件）中读取到的某一行起始偏移量
 * 所以keyin是long类型
 * 框架传入的value是从从待处理的数据文件（文本文件）中读取到的一行的数据的内容，
 * 所以valuein是String类型
 *
 * 但是，long和String是java的原生数据类型，序列化效率低
 * 所以，hadoop对其进行了改造，有一些替代品
 * long/LongWritable
 * Sring/Text
 * int/IntWritable
 *
 * map方法处理数据后之后需要写出一对key,value
 * keyout:map方法处理完成后，输出的结果中key的数据类型
 * valueout：map方法处理完成后，输出的结果中的value数据类型
 *
 */
public class PvConutMapper  extends Mapper<LongWritable, Text,Text, IntWritable> {
    /**
     * map方法是MR框架提供的，每读取一行数据就调用一次map方法
     * 也就是说读一行数据就要映射成一个key-value,然后map方法处理，处理完成后输出一个key-value
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /**
         * 由于是一行处理一次，我们不能在map阶段汇总
         * 但是我们可以把数据输出成
         * 216.244.66.231    1
         * 140.205.201.39    1
         * 140.205.201.39    1
         * 220.181.108.104   1 1 1 1
         *
         * 然后相同的IP的数据会被分发到同一台机器上去做reduce处理，分到一组，然后将value累加再输出
         * 实际上MR程序就是这么实现的
         */
        //value是Text类型
        String line = value.toString();
        String fields [] = line.split(" ");
        String ip = fields [0];
        //map将数据处理完成后交还给框架写出，由框架进行shuffle
        context.write(new Text(ip),new IntWritable(1));
    }
}

reduce

package pvcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
 * Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
 *  KEYIN与mapper  类输出的key的类型一致
 *  VALUEIN与mapper  类输出的value的类型一致
 */
public class PvCountReduce extends Reducer<Text, IntWritable,Text,IntWritable> {
    /**
     * 框架在reduce端整理好一组相同的key的数据后，一组数据调用一次reduce方法
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //140.205.201.39 :{1,1,1,1,1,1}
        //既然是一组，key相同，我们只需要把value遍历出来累加即可
        //先定义一个计数器
        int count = 0;
        for (IntWritable value:values){
            count+=value.get();
        }
       context.write(key,new IntWritable(count));
        //现在map阶段和reduce阶段已经完成，现在就需要将程序打包交给 yarn去执行
        //但是现在还不能，需要编写一个客户端程序，客户端程序要指定输入输出等，将程序提交给yarn去执行，
        //yarn拿到jar包之后，将jar包分发到其他机器
    }
}

driver

package pvcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
 * 编写yarn的客户端程序将MR程序提交给YARN去执行。yarn将jar包分发到多个nodemanager上
 * 执行是有先后顺序的，先执行map程序，再执行reduce程序
 */
public class PvCountRunner  {
    public static void main(String[] args) {
        /**
         * 程序启动需要设置一些参数，要告诉程序中的一些信息，那这些信息很零散
         * 所以我们把这些信息封装到一个对象中，这个对象就是job
         */
        try {
            Configuration conf = new Configuration();
            //获取job并携带参数
            Job job = Job.getInstance(conf,"pvCount");
            //可以用job对象封装一些信息
            //首先程序是一个jar包，就要指定jar包的位置
            //将jar包放在root目录下
            //可以将这个程序打包为pv.jar,上传到linux机器上
            //使用命令运行
            //hadoop jar /root/pv.jar pvcount.PvCountRunner /data/pvcount /out/pvcount
            //hadoop jar /root/pv.jar pvcount.CountRunner /access.log /count
            job.setJar("/root/pv.jar");
            /**
             * 一个jar包中可能有多个job,所以我们要指明job使用的是哪儿一个map类
             * 哪儿一个reduce类
             */
            job.setMapperClass(PvConutMapper.class);
            job.setReducerClass(PvCountReduce.class);

            /**
             * map端输出的数据要进行序列化，所以我们要告诉框架map端输出的数据类型
             */
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
            /**
             * reduce端要输出，所有也要指定数据类型
             */
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            /**
             * 告诉框架用什么组件去读数据，普通的文本文件，就用TextInputFormat
             * 导入长包
             */
            job.setInputFormatClass(TextInputFormat.class);
            /**
             * 告诉这个组件去哪儿读数据
             * TextInputFormat有个父类FileInputFormat
             * 用父类去指定到哪儿去读数据
             * 输入路径是一个目录，该目录下如果有子目录需要进行设置递归遍历，否则会报错
             */
            FileInputFormat.addInputPath(job,new Path(args[0]));
            FileInputFormat.setMinInputSplitSize(job,1000);
            FileInputFormat.setMaxInputSplitSize(job,1000000);
            //设置写出的组件
            job.setOutputFormatClass(TextOutputFormat.class);
            //设置写出的路径
            FileOutputFormat.setOutputPath(job,new Path(args[1]));

           // FileOutputFormat.
            job.setNumReduceTasks(5);
            //执行
            /**
             * 信息设置完成，可以调用方法向yarn去提交job
             * waitForCompletion方法就会将jar包提交给RM
             * 然后rm可以将jar包分发，其他机器就执行
             */
            //传入一个boolean类型的参数，如果是true,程序执行会返回true/flase
            //如果参数传入true,集群在运行时会有进度，这个进度会在客户端打印
            boolean res = job.waitForCompletion(true);
            /**
             * 客户端退出后会返回一个状态码，这个状态码我们可以写shell脚本的时候使用
             * 根据返回的状态码的不同区执行不同的逻辑
             */
            System.exit(res? 0:1);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

注：继承Mapper类可以重写setup方法：仅调用一次（可以用来连接数据库）
cleanup方法： maptask执行完成，仅调用一次可以用来关闭一些资源
注意：关闭资源的时候要防止内存泄露（直接关闭了生命周期长的对象，而没有关闭该对象中引用的短生命周期的对象）
Mapper类

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;

/** 
 * Maps input key/value pairs to a set of intermediate key/value pairs.  
 * 
 * <p>Maps are the individual tasks which transform input records into a 
 * intermediate records. The transformed intermediate records need not be of 
 * the same type as the input records. A given input pair may map to zero or 
 * many output pairs.</p> 
 * 
 * <p>The Hadoop Map-Reduce framework spawns one map task for each 
 * {@link InputSplit} generated by the {@link InputFormat} for the job.
 * <code>Mapper</code> implementations can access the {@link Configuration} for 
 * the job via the {@link JobContext#getConfiguration()}.
 * 
 * <p>The framework first calls 
 * {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
 * {@link #map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context)}
 * for each key/value pair in the <code>InputSplit</code>. Finally 
 * {@link #cleanup(org.apache.hadoop.mapreduce.Mapper.Context)} is called.</p>
 * 
 * <p>All intermediate values associated with a given output key are 
 * subsequently grouped by the framework, and passed to a {@link Reducer} to  
 * determine the final output. Users can control the sorting and grouping by 
 * specifying two key {@link RawComparator} classes.</p>
 *
 * <p>The <code>Mapper</code> outputs are partitioned per 
 * <code>Reducer</code>. Users can control which keys (and hence records) go to 
 * which <code>Reducer</code> by implementing a custom {@link Partitioner}.
 * 
 * <p>Users can optionally specify a <code>combiner</code>, via 
 * {@link Job#setCombinerClass(Class)}, to perform local aggregation of the 
 * intermediate outputs, which helps to cut down the amount of data transferred 
 * from the <code>Mapper</code> to the <code>Reducer</code>.
 * 
 * <p>Applications can specify if and how the intermediate
 * outputs are to be compressed and which {@link CompressionCodec}s are to be
 * used via the <code>Configuration</code>.</p>
 *  
 * <p>If the job has zero
 * reduces then the output of the <code>Mapper</code> is directly written
 * to the {@link OutputFormat} without sorting by keys.</p>
 * 
 * <p>Example:</p>
 * <p><blockquote><pre>
 * public class TokenCounterMapper 
 *     extends Mapper&lt;Object, Text, Text, IntWritable&gt;{
 *    
 *   private final static IntWritable one = new IntWritable(1);
 *   private Text word = new Text();
 *   
 *   public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
 *     StringTokenizer itr = new StringTokenizer(value.toString());
 *     while (itr.hasMoreTokens()) {
 *       word.set(itr.nextToken());
 *       context.write(word, one);
 *     }
 *   }
 * }
 * </pre></blockquote>
 *
 * <p>Applications may override the
 * {@link #run(org.apache.hadoop.mapreduce.Mapper.Context)} method to exert
 * greater control on map processing e.g. multi-threaded <code>Mapper</code>s 
 * etc.</p>
 * 
 * @see InputFormat
 * @see JobContext
 * @see Partitioner  
 * @see Reducer
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

mr用java实现一个数据清洗的案例，主要就是知道map端（偏移量和一行）和reduce端（map端的整合）的数据，理解了shuffle过程，不难

自定义数据传输过程中的javabean的话需要implements Writable（需要比较的话需要实现WritableComparable接口）

package local_program;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * Created by HP on 2019/4/14 21:57
 */
public class MovieBean implements Writable/*, WritableComparable */{
    Integer mid;
    Integer score;
    String time;
    Integer uid;

    public MovieBean(Integer mid, Integer score, String time, Integer uid) {
        this.mid = mid;
        this.score = score;
        this.time = time;
        this.uid = uid;
    }

    public MovieBean() {
    }

    public Integer getMid() {
        return mid;
    }

    public void setMid(Integer mid) {
        this.mid = mid;
    }

    public Integer getScore() {
        return score;
    }

    public void setScore(Integer score) {
        this.score = score;
    }

    public String getTime() {
        return time;
    }

    public void setTime(String time) {
        this.time = time;
    }

    public Integer getUid() {
        return uid;
    }

    public void setUid(Integer uid) {
        this.uid = uid;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(this.mid);
        dataOutput.writeInt(this.score);
        dataOutput.writeUTF(this.time);
        dataOutput.writeInt(this.uid);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.mid=dataInput.readInt();
        this.score=dataInput.readInt();
        this.time=dataInput.readUTF();
        this.uid=dataInput.readInt();
    }

    @Override
    public String toString() {
        return "MovieBean{" +
                "mid=" + mid +
                ", score=" + score +
                ", time='" + time + '\'' +
                ", uid=" + uid +
                '}';
    }

    /*@Override
    public int compareTo(Object o) {
        if(o instanceof MovieBean){
            MovieBean mb=(MovieBean)o;
            return mb.score>this.score?-1:1;
        }
        return 0;
    }*/
}

也可以implements Tool
在这里插入图片描述