MapReduce入门简介

MapReduce简介

MapReduce是一个软件框架,用于轻松编写应用程序,以可靠,容错的方式在大型集群(数千个节点)的商用硬件上并行处理大量数据(多TB数据集)。
MapReduce 作业通常将输入数据集拆分为独立的块,这些块由map任务以完全并行的方式处理。框架对map的输出进行排序,然后输入到reduce任务。通常,作业的输入和输出都存储在文件系统中。该框架负责调度任务,监视任务并重新执行失败的任务。
官网地址
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

job提交图解

在这里插入图片描述
注:appmaster属于mapreduce框架,不属于yarn(是nodemanager中的一个程序)
nodemanager如何知道自己分配到了任务?
3秒钟心跳机制,rm将自己的分配方案记录到调度队列里放到job.plit文件里面

MapReduce图解

在这里插入图片描述
分片
源码:min(min分片(1),max(blocksize,max分片(long的最大值)))(所以默认分片就是分块大小)
在这里插入图片描述

/** 
   * Generate the list of files and make them into FileSplits.
   * @param job the job context
   * @throws IOException
   */
  public List<InputSplit> getSplits(JobContext job) throws IOException {
    StopWatch sw = new StopWatch().start();
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
  }

  protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

maptask数量:文件数量 数据量 分片大小

分片是一个逻辑概念,分片信息包括起始偏移量,分片大小,分片数据所在的块的信息,块所在的主机列表。每一个分片对应着一个maptask,通过调整分片的大小可以调整maptask的数量,也就是调整map阶段的并行度。

long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));//返回1
long maxSize = getMaxSplitSize(job);//返回long的最大值
long splitSize = computeSplitSize(blockSize, minSize, maxSize)
return Math.max(minSize, Math.min(maxSize, blockSize));
//计算分片大小
 while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {} //分片有一个1.1倍的冗余

分片机制
(1)分片读取规则
①第一个分片从第一行(偏移量为零)开始读取,读到分片末尾,再读取下一个分片的第一行
②既不是第一个分片也不是最后一个分片,第一行数据舍去,读到分片末尾,再继续读取下一个分片的第一行数据
③最后一个分片舍去第一行,读到分片末尾
设置分片大小
FileInputFormat.setMinInputSplitSize(job,1000);
FileInputFormat.setMaxInputSplitSize(job,1000000);

判断是否可以分片,文件若是压缩的需要判断
map读取数据源码查看
src包里的InputFormat中的FileInputFormat中的TextInputFormat中的createRecordReader(中有一个LineRecordReader中的nextKeyValue、getCurrentKey、getCurrentValue流式读取数据)

shuffle(图解)

在这里插入图片描述
环形缓冲区20%保留区,80%溢写磁盘 ,溢写到spiller(溢写器)
在此期间3个数组管理(kv)元数据

shuffle指的就是map端无规则的输出到reduce的输入

mr程序编写:

  1. 用户编写mapreduce程序分为3个部分,Mapper,Reducer,Driver(提交运行mr程序的客户端)
  2. Mapper的输入数据是kv键值对的形式(kv的数据类型是可以自定义)
  3. Mapper的输出数据是kv键值对的形式(kv的数据类型是可以自定义)
  4. Mapper中的业务逻辑写在map方法中
  5. map方法对每一个maptask进程传入的kv键值对数据调用一次
  6. Reduce的输入数据是kv键值对的形式(kv的数据类型是可以自定义)
  7. Reduce的输出数据是kv键值对的形式(kv的数据类型是可以自定义)
  8. Reduce中的业务逻辑写在reduce方法中
  9. reducetask进程对每一组key相同的kv键值对数据调用一次reduce方法
  10. 用户自定义的Mapper和Reducer都要继承自各自的父类
  11. 整个程序需要一个Driver来进行提交,提交的是一个描述了各种必要信息的job对象

举例
map

package pvcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
 * 框架在调研共我们写的map方法业务时,会将数据作为参数(一个key,一个value)传入到map方法中
 * Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 * KEYIN:是框架(maptask)要传递给map方法的输入的参数中key的数据类型
 * VALUEIN:是框架(maptask)要传递给map方法的输入的参数中key的数据类型
 * 在默认情况下,框架传入的key是框架从待处理的数据文件(文本文件)中读取到的某一行起始偏移量
 * 所以keyin是long类型
 * 框架传入的value是从从待处理的数据文件(文本文件)中读取到的一行的数据的内容,
 * 所以valuein是String类型
 *
 * 但是,long和String是java的原生数据类型,序列化效率低
 * 所以,hadoop对其进行了改造,有一些替代品
 * long/LongWritable
 * Sring/Text
 * int/IntWritable
 *
 * map方法处理数据后之后需要写出一对key,value
 * keyout:map方法处理完成后,输出的结果中key的数据类型
 * valueout:map方法处理完成后,输出的结果中的value数据类型
 *
 */
public class PvConutMapper  extends Mapper<LongWritable, Text,Text, IntWritable> {
    /**
     * map方法是MR框架提供的,每读取一行数据就调用一次map方法
     * 也就是说读一行数据就要映射成一个key-value,然后map方法处理,处理完成后输出一个key-value
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /**
         * 由于是一行处理一次,我们不能在map阶段汇总
         * 但是我们可以把数据输出成
         * 216.244.66.231    1
         * 140.205.201.39    1
         * 140.205.201.39    1
         * 220.181.108.104   1 1 1 1
         *
         * 然后相同的IP的数据会被分发到同一台机器上去做reduce处理,分到一组,然后将value累加再输出
         * 实际上MR程序就是这么实现的
         */
        //value是Text类型
        String line = value.toString();
        String fields [] = line.split(" ");
        String ip = fields [0];
        //map将数据处理完成后交还给框架写出,由框架进行shuffle
        context.write(new Text(ip),new IntWritable(1));
    }
}

reduce

package pvcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
 * Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
 *  KEYIN与mapper  类输出的key的类型一致
 *  VALUEIN与mapper  类输出的value的类型一致
 */
public class PvCountReduce extends Reducer<Text, IntWritable,Text,IntWritable> {
    /**
     * 框架在reduce端整理好一组相同的key的数据后,一组数据调用一次reduce方法
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //140.205.201.39 :{1,1,1,1,1,1}
        //既然是一组,key相同,我们只需要把value遍历出来累加即可
        //先定义一个计数器
        int count = 0;
        for (IntWritable value:values){
            count+=value.get();
        }
       context.write(key,new IntWritable(count));
        //现在map阶段和reduce阶段已经完成,现在就需要将程序打包交给 yarn去执行
        //但是现在还不能,需要编写一个客户端程序,客户端程序要指定输入输出等,将程序提交给yarn去执行,
        //yarn拿到jar包之后,将jar包分发到其他机器
    }
}

driver

package pvcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
 * 编写yarn的客户端程序将MR程序提交给YARN去执行。yarn将jar包分发到多个nodemanager上
 * 执行是有先后顺序的,先执行map程序,再执行reduce程序
 */
public class PvCountRunner  {
    public static void main(String[] args) {
        /**
         * 程序启动需要设置一些参数,要告诉程序中的一些信息,那这些信息很零散
         * 所以我们把这些信息封装到一个对象中,这个对象就是job
         */
        try {
            Configuration conf = new Configuration();
            //获取job并携带参数
            Job job = Job.getInstance(conf,"pvCount");
            //可以用job对象封装一些信息
            //首先程序是一个jar包,就要指定jar包的位置
            //将jar包放在root目录下
            //可以将这个程序打包为pv.jar,上传到linux机器上
            //使用命令运行
            //hadoop jar /root/pv.jar pvcount.PvCountRunner /data/pvcount /out/pvcount
            //hadoop jar /root/pv.jar pvcount.CountRunner /access.log /count
            job.setJar("/root/pv.jar");
            /**
             * 一个jar包中可能有多个job,所以我们要指明job使用的是哪儿一个map类
             * 哪儿一个reduce类
             */
            job.setMapperClass(PvConutMapper.class);
            job.setReducerClass(PvCountReduce.class);

            /**
             * map端输出的数据要进行序列化,所以我们要告诉框架map端输出的数据类型
             */
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
            /**
             * reduce端要输出,所有也要指定数据类型
             */
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            /**
             * 告诉框架用什么组件去读数据,普通的文本文件,就用TextInputFormat
             * 导入长包
             */
            job.setInputFormatClass(TextInputFormat.class);
            /**
             * 告诉这个组件去哪儿读数据
             * TextInputFormat有个父类FileInputFormat
             * 用父类去指定到哪儿去读数据
             * 输入路径是一个目录,该目录下如果有子目录需要进行设置递归遍历,否则会报错
             */
            FileInputFormat.addInputPath(job,new Path(args[0]));
            FileInputFormat.setMinInputSplitSize(job,1000);
            FileInputFormat.setMaxInputSplitSize(job,1000000);
            //设置写出的组件
            job.setOutputFormatClass(TextOutputFormat.class);
            //设置写出的路径
            FileOutputFormat.setOutputPath(job,new Path(args[1]));

           // FileOutputFormat.
            job.setNumReduceTasks(5);
            //执行
            /**
             * 信息设置完成,可以调用方法向yarn去提交job
             * waitForCompletion方法就会将jar包提交给RM
             * 然后rm可以将jar包分发,其他机器就执行
             */
            //传入一个boolean类型的参数,如果是true,程序执行会返回true/flase
            //如果参数传入true,集群在运行时会有进度,这个进度会在客户端打印
            boolean res = job.waitForCompletion(true);
            /**
             * 客户端退出后会返回一个状态码,这个状态码我们可以写shell脚本的时候使用
             * 根据返回的状态码的不同区执行不同的逻辑
             */
            System.exit(res? 0:1);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

注:继承Mapper类可以重写setup方法:仅调用一次(可以用来连接数据库)
cleanup方法: maptask执行完成,仅调用一次 可以用来关闭一些资源
注意:关闭资源的时候要防止内存泄露(直接关闭了生命周期长的对象,而没有关闭该对象中引用的短生命周期的对象)
Mapper类

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;

/** 
 * Maps input key/value pairs to a set of intermediate key/value pairs.  
 * 
 * <p>Maps are the individual tasks which transform input records into a 
 * intermediate records. The transformed intermediate records need not be of 
 * the same type as the input records. A given input pair may map to zero or 
 * many output pairs.</p> 
 * 
 * <p>The Hadoop Map-Reduce framework spawns one map task for each 
 * {@link InputSplit} generated by the {@link InputFormat} for the job.
 * <code>Mapper</code> implementations can access the {@link Configuration} for 
 * the job via the {@link JobContext#getConfiguration()}.
 * 
 * <p>The framework first calls 
 * {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
 * {@link #map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context)}
 * for each key/value pair in the <code>InputSplit</code>. Finally 
 * {@link #cleanup(org.apache.hadoop.mapreduce.Mapper.Context)} is called.</p>
 * 
 * <p>All intermediate values associated with a given output key are 
 * subsequently grouped by the framework, and passed to a {@link Reducer} to  
 * determine the final output. Users can control the sorting and grouping by 
 * specifying two key {@link RawComparator} classes.</p>
 *
 * <p>The <code>Mapper</code> outputs are partitioned per 
 * <code>Reducer</code>. Users can control which keys (and hence records) go to 
 * which <code>Reducer</code> by implementing a custom {@link Partitioner}.
 * 
 * <p>Users can optionally specify a <code>combiner</code>, via 
 * {@link Job#setCombinerClass(Class)}, to perform local aggregation of the 
 * intermediate outputs, which helps to cut down the amount of data transferred 
 * from the <code>Mapper</code> to the <code>Reducer</code>.
 * 
 * <p>Applications can specify if and how the intermediate
 * outputs are to be compressed and which {@link CompressionCodec}s are to be
 * used via the <code>Configuration</code>.</p>
 *  
 * <p>If the job has zero
 * reduces then the output of the <code>Mapper</code> is directly written
 * to the {@link OutputFormat} without sorting by keys.</p>
 * 
 * <p>Example:</p>
 * <p><blockquote><pre>
 * public class TokenCounterMapper 
 *     extends Mapper&lt;Object, Text, Text, IntWritable&gt;{
 *    
 *   private final static IntWritable one = new IntWritable(1);
 *   private Text word = new Text();
 *   
 *   public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
 *     StringTokenizer itr = new StringTokenizer(value.toString());
 *     while (itr.hasMoreTokens()) {
 *       word.set(itr.nextToken());
 *       context.write(word, one);
 *     }
 *   }
 * }
 * </pre></blockquote>
 *
 * <p>Applications may override the
 * {@link #run(org.apache.hadoop.mapreduce.Mapper.Context)} method to exert
 * greater control on map processing e.g. multi-threaded <code>Mapper</code>s 
 * etc.</p>
 * 
 * @see InputFormat
 * @see JobContext
 * @see Partitioner  
 * @see Reducer
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

mr用java实现一个数据清洗的案例,主要就是知道map端(偏移量和一行)和reduce端(map端的整合)的数据,理解了shuffle过程,不难

自定义数据传输过程中的javabean的话需要implements Writable(需要比较的话需要实现WritableComparable接口)

package local_program;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * Created by HP on 2019/4/14 21:57
 */
public class MovieBean implements Writable/*, WritableComparable */{
    Integer mid;
    Integer score;
    String time;
    Integer uid;

    public MovieBean(Integer mid, Integer score, String time, Integer uid) {
        this.mid = mid;
        this.score = score;
        this.time = time;
        this.uid = uid;
    }

    public MovieBean() {
    }

    public Integer getMid() {
        return mid;
    }

    public void setMid(Integer mid) {
        this.mid = mid;
    }

    public Integer getScore() {
        return score;
    }

    public void setScore(Integer score) {
        this.score = score;
    }

    public String getTime() {
        return time;
    }

    public void setTime(String time) {
        this.time = time;
    }

    public Integer getUid() {
        return uid;
    }

    public void setUid(Integer uid) {
        this.uid = uid;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(this.mid);
        dataOutput.writeInt(this.score);
        dataOutput.writeUTF(this.time);
        dataOutput.writeInt(this.uid);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.mid=dataInput.readInt();
        this.score=dataInput.readInt();
        this.time=dataInput.readUTF();
        this.uid=dataInput.readInt();
    }

    @Override
    public String toString() {
        return "MovieBean{" +
                "mid=" + mid +
                ", score=" + score +
                ", time='" + time + '\'' +
                ", uid=" + uid +
                '}';
    }

    /*@Override
    public int compareTo(Object o) {
        if(o instanceof MovieBean){
            MovieBean mb=(MovieBean)o;
            return mb.score>this.score?-1:1;
        }
        return 0;
    }*/
}

也可以implements Tool
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值