大数据学习(三)--利用MapReduce对多文件数据进行排序

最新推荐文章于 2024-06-14 01:24:13 发布

6点A君

最新推荐文章于 2024-06-14 01:24:13 发布

阅读量4.4k

点赞数 8

分类专栏： Hadoop 文章标签： Hadoop MapReduce

本文链接：https://blog.csdn.net/anLA_/article/details/88747102

版权

Hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

先来一个小插曲

MapReduce Job中的全局数据

在MapReduce中如何保存全局数据呢？可以考虑以下几种方式

读写HDFS文件，即将变量存在一个地方
配置Job属性，即将变量写道配置（Configuration）中
使用DistributedCache，但是DistributedCache是只读的

排序

首先联想MapReduce过程，先Map，给输入，并给输出。Reduce则是将结果处理进行计算。
在MapReduce过程中本身就有排序，MapReduce的默认排序是按照key值进行排序，如果key为int的IntWritable，则按照大小排序，如果key为String，则按照ascii 码进行排序。
但是有个问题，Reducer自动排序的数据仅仅是发送到自己节点数据，使用默认的排序并不能保证全局的顺序，因为在排序前还有个partition的过程，即能保证内部顺序性，而无法保证节点之间数据顺序性。
那么为了完全实现内部节点之间的顺序性，那么就需要自定义Partition类，保证执行Partition过程之后所有Reduce上的数据在整体上是有序的。
本代码以以下思路进行：

将读入的数据转化成IntWritable型，然后将值作为key输出（value任意）
重写partition，保证整体有序，用输入数据的最大值a除以系统partition数量b的商c作为分割数据的边界增量，也就是说分割数据的边界是这个商c的1倍，2倍至partition-1倍。这样就能保证执行partition后的数据是整体有序的。
Reduce获得<key, value-list>后，根据value-list中元素个数，将输入的key作为value的输出次数（即有相同的就输出多个），输出的key是一个全局变量，用于统计当前key的位次（即在所有数中排第几）。

示例参数

file01:

file02

file03

示例输出：
在这里插入图片描述

源码

最后给出源码，便于大家理解：

package com.anla.chapter2;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

/**
 * @user anLA7856
 * @time 19-3-22 下午4:38
 * @description
 */
public class Sort {

    public static class Map extends Mapper<Object, Text, IntWritable, IntWritable> {
        private static IntWritable data = new IntWritable();

        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();    // 因为一行一个
            data.set(Integer.parseInt(line));
            context.write(data, new IntWritable(1));
        }
    }

    public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        private static IntWritable lineNum = new IntWritable(1);

        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable val : values){
                context.write(lineNum, key);      // key 为1,最终value为key
                lineNum = new IntWritable(lineNum.get()+1);    // 简单自增
            }
        }
    }

    public static class Partition extends Partitioner<IntWritable, IntWritable>{

        public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
            int maxNumber = 65223;
            int bound = maxNumber / numPartitions + 1;
            int keyNumber = key.get();
            for (int i = 0;i < numPartitions; i++) {
                if (keyNumber > bound*numPartitions && keyNumber < bound *(numPartitions + 1)) {
                    return numPartitions;
                }
            }
            return -1;
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration configuration = new Configuration();
        String[] otherArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.out.println("Usage: Sort <in> <out>");
            System.exit(2);
        }
        Job job = Job.getInstance(configuration, "sort");
        job.setJarByClass(Sort.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setPartitionerClass(Partition.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0: 1);
    }
}

运行方法可参看博主上一篇文章：
跟A君学大数据(二)-手把手运行Hadoop的WordCount程序

参考资料：

Hadoop IN Action

6点A君

关注

8
点赞
踩
33

收藏

觉得还不错? 一键收藏
3
评论
大数据学习(三)--利用MapReduce对多文件数据进行排序

先来一个小插曲MapReduce Job中的全局数据在MapReduce中如何保存全局数据呢？可以考虑以下几种方式读写HDFS文件，即将变量存在一个地方配置Job属性，即将变量写道配置（Configuration）中使用DistributedCache，但是DistributedCache是只读的排序首先联想MapReduce过程，先Map，给输入，并给输出。Reduce则是将结...
复制链接

扫一扫

专栏目录