Hadoop之MapReduce

最新推荐文章于 2022-09-25 14:20:10 发布

leiline

最新推荐文章于 2022-09-25 14:20:10 发布

阅读量231

点赞数

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/leiline/article/details/70229106

版权

数据挖掘专栏收录该内容

7 篇文章 0 订阅

订阅专栏

什么是MapReduce?

MapReduce是一种流行的分布式计算框架，它被设计并用于并行计算海量数据。MapReduce框架的核心主要分为两个部分：Map和Reduce。当向MapReduce框架提交一个作业时，它首先会把计算作业拆分成若干个Map任务，然后分配到不同的节点上来执行。每个Map任务处理数据中的一部分，当Map任务完成后，它会生成一些中间文件，这些中间文件会作为Reduce任务的输入数据。Reduce任务的主要目标就是把前面若干Map的输出汇总并输出。

Map过程

每个输入分片会让一个map任务来处理。默认情况下，以HDFS的一个块的大小（64M）为一个分片。map输出的结果会暂时存放在一个环形内存缓冲区中，当该缓冲区快要溢出的时候，会在本地创建一个溢出文件，将缓冲区的数据写入文件；

当map任务输出最后一个记录时，可能会有很多的溢出文件，这时需要将这些文件合并。

将分区中的数据拷贝给相对应的reduce任务。

Reduce过程

Reduce会接收到不同map任务传来的数据，并且每个map传来的数据都是有序的。

合并的过程中会产生许多的中间文件（写入磁盘了），但MapReduce会让写入磁盘的数据尽可能地少，并且最后一次合并的结果并没有写入磁盘，而是直接输入到reduce函数。

用代码说明MapReduce是如何工作的

对Job进行设置

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MinTemperature {

    public static void main(String[] args) throws Exception {
        if(args.length != 2) {
            System.err.println("Usage: MinTemperature<input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(MinTemperature.class);
        job.setJobName("Min temperature");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(MinTemperatureMapper.class);
        job.setReducerClass(MinTemperatureReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Job对象指定作业执行规范。

在不指定JAR文件的名称时，在Job对象的setJarByClass()方法中传递一个类即可。

构造Job对象后，需要指定输入和输出数据的路径，调用FileInputFormat类的静态方法addInputPath定义输入数据的路径。同样用FileOutputFormat类的静态方法setOutputPath()方法定义输出数据的路径。（只能有一个输出路径）这个路径应该是不存在的，否在hadoop会报错并拒绝运行。

通过setMapperClass()和setReducerClass()指定map类型和reduce类型。setOutputkeClass()和setOutputValueClass()控制map和reduce函数的输出类型。这两个输出类型一般都是相同的。

waitForCompletion()方法返回一个布尔值，表示执行的成败。

Map过程

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MinTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

    private static final int MISSING = 9999;

    @Override 
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String year = line.substring(15, 19);

        int airTemperature;
        if(line.charAt(87) == '+') {
            airTemperature = Integer.parseInt(line.substring(88, 92));
        } else {
            airTemperature = Integer.parseInt(line.substring(87, 92));
        }

        String quality = line.substring(92, 93);
        if(airTemperature != MISSING && quality.matches("[01459]")) {
            context.write(new Text(year), new IntWritable(airTemperature));
        }
    }
}

Mapper类是一个泛型类型，有四个泛型参数，分别制定map函数的输入键，输入值，输出键，输出值的类型。

map()方法输入是一个键和一个值， Context实例用于输出内容的写入。

Reduce过程

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MinTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int minValue = Integer.MAX_VALUE;
        for(IntWritable value : values) {
            minValue = Math.min(minValue, value.get());
        }
        context.write(key, new IntWritable(minValue));
    }
}