Mapreduce的基本使用（2）

最新推荐文章于 2022-12-08 20:41:45 发布

KujyouRuri

最新推荐文章于 2022-12-08 20:41:45 发布

阅读量137

点赞数

本文链接：https://blog.csdn.net/KujyouRuri/article/details/114944880

版权

Mapreduce的分区和逻辑代码实现

mapreduce当中的分区：物以类聚，人以群分，相同key的数据，去往同一个reduce。

一般情况下reduce·task的数量默认只有一个，手动指定的代码：

job.setNumReduce(3)

一个reducetask对应一个输出文件

代码实现如下(POM文件配置与之前相同)：

PartitionMain部分:

     public class PartitionMain extends Configured implements Tool {
   @Override
   public int run(String[] args) throws Exception {
       //获取我们的job对象，封装我们的job任务
       Job job = Job.getInstance(super.getConf(), "myPartition");
       //打成jar包到集群的时候需要的
       job.setJarByClass(PartitionMain.class);

       //第一步  读取文件，解析成k1  v1
       TextInputFormat.addInputPath(job,new Path(args[0]));
       job.setInputFormatClass(TextInputFormat.class);
       //第二步：自定义mapper逻辑，接收k1  v1  转换成新的k2  v2  输出
       job.setMapperClass(PartitionMapper.class);
       //设置我们k2  v2的类型
       job.setMapOutputKeyClass(Text.class);
       job.setMapOutputValueClass(NullWritable.class);

       //第三步：分区 相同key的数据发送到同一个reduce当中去
       job.setPartitionerClass(cn.itcast.mr.demo1.PartitionOwn.class);
       //第四步：排序
       //第五步：规约
       //第六步：分组

       //第七步：reduce逻辑
       job.setReducerClass(PartitionReducer.class);
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(NullWritable.class);

       //如果reducetask的数量比分区的数量多，那么就会有空文件
       //如果reducetask的数量比分区的个数少，那么就会有些reduce里面要处理更多的数据
       job.setNumReduceTasks(2);        //设置我们 reducetask的数量,一般和所需要输出的文件匹配

       //第八步：输出
       job.setOutputFormatClass(TextOutputFormat.class);
       TextOutputFormat.setOutputPath(job,new Path(args[1]));
       boolean b = job.waitForCompletion(true);
       return b?0:1;
   }


   public static void main(String[] args) throws Exception {
       int run = ToolRunner.run(new Configuration(), new PartitionMain(), args);
       System.exit(run);

   }

}

PartitionMapper:

     
public class PartitionMapper extends Mapper<LongWritable,Text,Text,NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //输出我们的k2   v2  其中k2是我们一行文本数据  v2 是nullwritable类型
        context.write(value,NullWritable.get());
    }
}

PartitionReducer:

      public class PartitionReducer extends Reducer<Text,NullWritable,Text,NullWritable> {

    //将我们的数据输出
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());

    }
}

PartitionOwn:

public class PartitionOwn extends Partitioner<Text,NullWritable> {

    /**
     * 这个方法决定了数据去往哪一个reduce
     * @param text   k2类型
     * @param nullWritable  v2类型
     * @param numReduceTask
     * @return
     */
    @Override
    public int getPartition(Text text, NullWritable NullWritable, int numReduceTask) {
        String line = text.toString();
        String[] split = line.split("\t");
        if(Integer.parseInt(split[5]) > 15){
            //判断如果结果值大于15去往一个分区，小于等于15去往一个分区
            return 0;
        }else{
            return 1;
        }

    }
}

运行打包后将original文件和源数据partition放入linux系统中：
在这里插入图片描述

在linux系统中启动hadoop执行如下命令：

       cd /export/servers
       hdfs dfs  -mkdir /partitionin
       hdfs dfs -put  partition.csv  /partitionin
       hadoop jar original-day04_mapreduce-1.0-SNAPSHOT.jar  PartitionMain  /partitionin  /partitionout

运行成功后在Hadoop的页面下查找：
在这里插入图片描述
可以看见分区成功，数据一共分成了两个文件。

Mapreduce当中的二次排序和代码实现

mapreduce当中的排序功能：

默认是有排序功能的，按照字段顺序来排序，对key2进行排序

hadoop当中没有沿用java序列化serialize方式，使用的是writable接口，实现了writable就可以序列化

序列化——实现writeable接口

排序——实现comparable接口

既需要序列化也需要排序：实现writeable和comparable或者WritableComparable

如果一行文本内容作为key2不能实现二次排序，就把两个字段封装成一个javaben当成key2

compareto：

  int 数值比较 返回1表示大，返回-1表示小，返回0表示相等

 String 字符串比较：结果为字符数目的差值

Integer. valueOf()方法的作用
Integer. valueOf()可以将基本类型int转换为包装类型Integer，或者将String转换成Integer，String如果为Null或“”都会报错

JAVA源代码：

SortMain:

     
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SortMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        //获取job对象
        Job job = Job.getInstance(super.getConf(), "sort");

        //第一步：读取文件解析成k1  v1
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("file:///D:\\大数据练习\\input"));

        //第二步:自定义map逻辑  ，输入 k1   v1   输出  k2  v2
        job.setMapperClass(SortMapper.class);
        job.setMapOutputKeyClass(K2Bean.class);
        job.setMapOutputValueClass(NullWritable.class);
        /**
         * 分区
         * 排序
         * 规约
         * 分组
         * 省略
         */

        //设置我们的规约类
        job.setCombinerClass(SortReducer.class);


        //第七步：自定义reduce逻辑
        job.setReducerClass(SortReducer.class);
        job.setOutputKeyClass(K2Bean.class);
        job.setOutputValueClass(NullWritable.class);

        //第八步：输出
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job,new Path("file:///D:\\大数据练习\\input"));


        //提交任务
        boolean b = job.waitForCompletion(true);

        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new Configuration(), new SortMain(), args);
        System.exit(run);

    }

}

SortMapper:

   
        
public class SortMapper extends Mapper<LongWritable,Text,K2Bean,NullWritable> {

    /**
     * 读取数据，封装到我们的k2里面去
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //计数我们map输入了多少条数据
        Counter counter = context.getCounter("MR_INPUT_COUNT", "MAP_TOTAL_RESULT");
        counter.increment(1L);

        String[] split = value.toString().split("\t");
        K2Bean k2Bean = new K2Bean();
        k2Bean.setFirst(split[0]);
        k2Bean.setSecond(Integer.parseInt(split[1]));
        context.write(k2Bean,NullWritable.get());

    }
}

K2Bean:

     
     public class K2Bean implements WritableComparable<K2Bean>{
    /**
     a  1
     a  5
     a  7             //源文本内容
     a  9
     a  9
     b  3
     b  8
     b  10
     */

    private String first ;
    private int second;


    /*
    compareTo方法，用于我们的数据的比较排序
     */
    @Override
    public int compareTo(K2Bean o) {
        //如何进行比较
        //首先比较第一个字段，如果第一个字段相同，就比较第二个字段
        //如果不同，没有可比性，直接返回结果

        //如果i == 0  说明第一个字段相等了吧
        //如果第一个字段相等，继续比较第二个
        int i = this.first.compareTo(o.first);
        if(i ==0){
            //第一个字段相等，继续比较第二个字段
            int i1 = Integer.valueOf(this.second).compareTo(Integer.valueOf(o.second));
            return i1;     //正升序，负降序
        }else{
            //直接将我们比较的结果返回回去
            return i;     //正升序，负降序
        }


    }

    //序列化的方法
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(first);
        out.writeInt(second);

    }

    /**
     * 反序列化的方法
     * @param in
     * @throws IOException
     */
    @Override
    public void readFields(DataInput in) throws IOException {
        this.first  = in.readUTF();
        this.second = in.readInt();

    }

    public String getFirst() {
        return first;                                 //getter和setter方法
    }

    public void setFirst(String first) {
        this.first = first;
    }

    public int getSecond() {
        return second;
    }

    public void setSecond(int second) {
        this.second = second;
    }

    @Override
    public String toString() {
        return first+"\t"+second;               //方法重写，保证正常输出first和second
    }
}

MyCombinerClass（规约的实现代码，需要继承reducer类，是一个相当于聚合）：

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyCombinerClass extends Reducer<K2Bean,NullWritable,K2Bean,NullWritable> {
    @Override
    protected void reduce(K2Bean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {

    }              //不改变最后的输出结果，但是可以减少数据的发送量，用于调优
}

Mapreduce当中的计数器

Hadoop当中的内置计数器：

Mapreduce任务计数器
文件系统计数器
FileinputFormat计数器
FileOutputFormat计数器
作业计数器

枚举类型：

        
      public static  enum   Counter{
      
      REDUCE_INPUT_RECORD,
              REDUCE_OUTPUT_RECORD
  }



  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      //计数我们map输入了多少条数
     
      org.apache.hadoop.mapreduce.Counter counter = context.getCounter(Counter. REDUCE_INPUT_RECORD);
     counter.increment(1L);
      for(NullWritable value : values){
          org.apache.hadoop.mapreduce.Counter counter = context.getCounter(Counter. REDUCE_OUTPUT_RECORD);
      }

String类型：

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

       Counter counter = context.getCounter("MR_INPUT_COUNT", "MAP_TOTAL_RESULT");

KujyouRuri

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Mapreduce的基本使用（2）

Mapreduce的分区和逻辑代码实现mapreduce当中的分区：物以类聚，人以群分，相同key的数据，去往同一个reduce。一般情况下reduce·task的数量默认只有一个，手动指定的代码：job.setNumReduce(3)一个reducetask对应一个输出文件代码实现如下(POM文件配置与之前相同)：PartitionMain部分: public class PartitionMain extends Configured implements Tool { @
复制链接

扫一扫