Mapreduce相关整理

最新推荐文章于 2024-07-11 07:15:00 发布

SONGZ_

最新推荐文章于 2024-07-11 07:15:00 发布

阅读量91

点赞数

文章标签： mapreduce

本文链接：https://blog.csdn.net/TrUthSong/article/details/105055255

版权

MapReduce编程模型

在这里插入图片描述
1.inputFormat 输入文件，split拆分。
2.Mapping阶段，按指定分割符拆分数据，按指定格式输出。
3.shuffle阶段，依据key对数据进行归并，将相同的key的数据整合到一起。
4.reduce阶段，逻辑运算，如求和，求平均等。

Map阶段：map task

public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>{


    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        // 把value对应的行数据按照指定的分隔符拆开
        String[] words = value.toString().split("\t");


        for(String word : words) {
            // (hello,1)  (world,1)
            context.write(new Text(word.toLowerCase()), new IntWritable(1));
        }
    }
}

Reduce阶段：reduce task

public class WordCountReducer extends Reducer<Text,IntWritable, Text,IntWritable>{


    /**
     * (hello,1)  (world,1)
     * (hello,1)  (world,1)
     * (hello,1)  (world,1)
     * (welcome,1)
     *
     * map的输出到reduce端，是按照相同的key分发到一个reduce上去执行
     *
     * reduce1： (hello,1)(hello,1)(hello,1)  ==> (hello, <1,1,1>)
     * reduce2: (world,1)(world,1)(world,1)   ==> (world, <1,1,1>)
     * reduce3 (welcome,1)  ==> (welcome, <1>)
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int count = 0;

        Iterator<IntWritable> iterator = values.iterator();

        //<1,1,1>
        while (iterator.hasNext()) {
            IntWritable value = iterator.next();
            count += value.get();
        }

        context.write(key, new IntWritable(count));
    }
}

MapReduce 执行过程

在这里插入图片描述

hdfs或本地文件通过inputformat读入
经过split差分成多个文件（默认一个block一个split）
recordReader：数据读取器，拆分成kv格式。
map阶段
map阶段输出后，可经过combiner，即在map task预先执行reduce相同的业务逻辑。
经过shuffle阶段
reduce阶段
outputformat

combiner

combiner可以看做局部的reducer。合并相同key对应的value值。
combiner默认和reduce的逻辑是一致的，所以不需要单独写combiner的逻辑。
设置：在driver的job中配置job.setCombinerClass(Reducer.class);

优点：
可以减少map task 的输出数据量。对spill，merge文件都可以进行压缩。
减少reduce-map网络io

适用场景：
不是所有的场景都适用combiner。一般来说combiner 适用于求和的操作，但不适用于求平均数。

partitioner

Partitioner 处于 Mapper阶段，当Mapper处理好数据后，这些数据需要经过Partitioner进行分区，来选择不同的Reducer处理，从而将Mapper的输出结果均匀的分布在Reducer上面执行。
对于map输出的每一个键值对，系统都会给定一个partition，partition值 = hash(key) mod R，这里的R代表Reduce Task 的数目。默认通过计算key的hash值后对Reduce task的数量取模获得。如果一个键值对的partition值为1，意味着这个键值对会交给第一个Reducer处理。
也可以通过自定义partitioner的方式，控制reduce输出，从而达到分类输出的目的。

public class MyPartitioner extends Partitioner<Text, Access>{
    @Override
    public int getPartition(Text name, Access access, int numReduceTasks) {

        if(name.toString().startsWith("A")) {
            return 0;
        } else if(name.toString().startsWith("B")) {
            return 1;
        } else {
            return 2;
        }
    }
}

设置：job.setPartitionerClass(MyPartitioner.class);

shuffle

map-shuffle

spill：溢写磁盘。每一个map会将结果存入一个环形缓冲区中。当缓冲区容量达到80%时，就会发生spill。（内存缓冲区默认100M）
在缓冲区中会对数据进行分区，打标签。（partition：hash(key) mod R。）
当缓冲区打到80%时，会将数据按key排序后溢写磁盘，生成多个小文件。
merge：最后会按标签，将多个小文件合并排序，生成大文件。

map task 会和 appmaster 通信，appmaster会通知reduce task 来拉取数据。

reduce-shuffle

启动多个线程，去不同机器上拉取属于自己的文件
对文件进行合并和排序
对相同key的value值进行合并

Join

reduce join

public class ReduceJoin {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(ReduceJoin.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReduce.class);

        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(Join.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path("join/input"));
        FileOutputFormat.setOutputPath(job,new Path("join/output"));

        job.waitForCompletion(true);

    }

    //10	ACCOUNTING	NEW YORK
    //7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
    public static class MyMapper extends Mapper<LongWritable, Text, IntWritable, Join> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] line = value.toString().split("\t");

            if (line.length == 3) {
                int deptno = Integer.parseInt(line[0]);
                String dept_name = line[1];

                context.write(new IntWritable(deptno), new Join(dept_name, "d"));

            } else if (line.length == 8) {
                int deptno = Integer.parseInt(line[7]);
                String empno = line[0];
                String ename = line[1];
                String sal = line[5];
                StringBuilder sb = new StringBuilder();
                sb.append(empno).append("\t").append(ename).append("\t").append(sal);

                context.write(new IntWritable(deptno), new Join(sb.toString(), "e"));
            }

        }
    }

    public static class MyReduce extends Reducer<IntWritable, Join, Text, NullWritable> {
        @Override
        protected void reduce(IntWritable key, Iterable<Join> values, Context context) throws IOException, InterruptedException {

            List<String> dept = new ArrayList<>();
            List<String> emp = new ArrayList<>();


            for (Join join : values) {
                if ("e".equals(join.getFlag())) {
                    emp.add(join.getData());
                } else if ("d".equals(join.getFlag())) {
                    dept.add(join.getData());
                }
            }


            int i, j;

            for (i = 0; i < emp.size(); i++) {
                for (j = 0; j < dept.size(); j++) {
                    context.write(new Text(emp.get(i) + "\t" + dept.get(j)), NullWritable.get());
                }
            }
        }
    }

}

map端读入两份数据，A join B 。在map阶段对两份数据分别打上flag。并用join关联字段作为key。
在reduce阶段，通过flag区分两份数据，进行拼接关联逻辑。

map join

public class MapJoinApp {

    public static void main(String[] args)throws Exception {

        Configuration configuration = new Configuration();

        Job job = Job.getInstance(configuration);
        job.setJarByClass(MapJoinApp.class);
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(0);  //***设置没有reduce

        // 把小文件加到分布式缓存
        job.addCacheFile(new URI("join/input/dept.txt"));
        FileInputFormat.setInputPaths(job, new Path("input/join/input/emp.txt"));

        Path outputDir = new Path("input/join/mapoutput");
        outputDir.getFileSystem(configuration).delete(outputDir,true);
        FileOutputFormat.setOutputPath(job, outputDir);

        job.waitForCompletion(true);
    }


    public static class MyMapper extends Mapper<LongWritable,Text, Text, NullWritable> {

        private static Map<Integer,String> cache = new HashMap<>();

		//预处理，在经过map之前的操作
        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            String path = context.getCacheFiles()[0].toString();
            BufferedReader reader = new BufferedReader(new FileReader(path));
            String readLine = null;
            while((readLine = reader.readLine()) != null) {
                String[] splits = readLine.split("\t");  // dept
                int deptno = Integer.parseInt(splits[0]);
                String dname = splits[1];
                cache.put(deptno, dname);
            }
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String[] splits = value.toString().split("\t");
            int length = splits.length;

            StringBuilder builder = new StringBuilder();

            if (length == 8) {  //emp
                String empno = splits[0];
                String ename = splits[1];
                String sal = splits[5];
                int deptno = Integer.parseInt(splits[7]);

                String dname = cache.get(deptno);

                builder.append(empno).append("\t")
                        .append(ename).append("\t")
                        .append(sal).append("\t")
                        .append(dname);

                context.write(new Text(builder.toString()), NullWritable.get());
            }
        }
    }
}

将较小的文件加载到分布式缓存中。实现在map端即可关联数据。

注意：
仅适用于join有一方数据较小的情况。
map join 不涉及reduce ，要设置job.setNumReduceTasks(0);reduce task个数为0，测试中发现不设置的话还会经过reduce操作。

SONGZ_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Mapreduce相关整理

MapReduce编程模型1.inputFormat 输入文件，split拆分。2.Mapping阶段，按指定分割符拆分数据，按指定格式输出。3.shuffle阶段，依据key对数据进行归并，将相同的key的数据整合到一起。4.reduce阶段，逻辑运算，如求和，求平均等。Map阶段：map taskpublic class WordCountMapper extends Mappe...
复制链接

扫一扫