大数据之hadoop[MapReduce]学习笔记

最新推荐文章于 2021-12-20 15:38:10 发布

张章章Sam

最新推荐文章于 2021-12-20 15:38:10 发布

阅读量4.4k

点赞数

文章标签： mapreduce 大数据 hadoop 数据地图

本文链接：https://blog.csdn.net/qq_16103331/article/details/52927398

版权

默认block.size=128M
一、MR的执行步骤：
一）、Map
1、block块被InputSplit的实现类切割成一个个的split分片，默认情况下：一个block–>1split
2、每一个split分片中的数据由RecoderReader的实现类切割成一行一行的数据，这每一行的数据一个,这个一个k-v就是我们在Map阶段的k1，v1，
其中k1是每一行数据的偏移量（每一个的起始位置），v1代表每一行的数据内容，显然，k1的数据类型==>长整型，v1的数据类型是字符型。
3、执行用户按照业务实现的map()的方法，经过一系列的处理，map()会将结果输出。这里map的输入就是上一个步骤中的

在执行MR的过程中会严重依赖磁盘的IO和网络的IO，这两个IO也是我们MR的慢的瓶颈。

二、MR的编程
public class WCDriver {
public static void main(String[] args) {
if(args == null && args.length < 2) {
throw new RumtimeException(“parameter errors, Usage: ”);
//System.exit(-1);
}

    String inputPath = args[0];
    String outputPath = args[1];
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, WCDriver.class.getSimpleName());
    //因为mr的执行是一个jar
    job.setJarByClass(WCDriver.class);
    //如果说输出目录已经存在的话，删除之，不然会出现dir already exists异常
    FileSystem fs = FileSystem.newInstance(conf);
    fs.delete(new Path(outputPath), true);
    //设置我们map的输入
    FileInputFormat.setInputPath(job, inputPath);
    job.setMapperClass(WCMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(LongWritable.class);
    job.setInputFormatClass(TextInputFormat.class);

    //设置reducer
    FileOutputFormat.setOutputPath(job, new Path(outputPath));
    job.setReducerClass(WCReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    //设置job reducer的数量
    job.setNumReducerTasks(1);
    //提交job
    job.waitForCompletion(true);
}

static class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    protect void map(LongWritable k1, Text v1, Context context) throws Exception {
        String line = v1.toString();
        String[] splits = line.split(" ");
        for(String word : splits) {
            context.write(new Text(word), new LongWritable(1))
        }
    }
}

static class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    protect void reduce(Text k2, Iterable<LongWritable> v2s, Context context) throws Exception {
        long sum = 0L;
        for(LongWritable lw : v2s) {
            sum += 1;
        }
        context.write(k2, new LongWritable(sum));
    }
}

}

在进行编写MR程序的时候，先编写Map和reducer，完成之后在填充Driver

我们在mapre-site.xml中配置了

mapreduce.jobhistory.address
slave01:10020

mapreduce.jobhistory.webapp.address
slave01:19888

查看历史日志的时候出现了如下问题：
Aggregation is not enabled. 需要我们在yarn-site.xml配置如下信息

yarn.log-aggregation-enable
true

开启jobhistoryserver进程的命令
mr-jobhistory-daemon.sh start historyserver
同时重启yarn(resourcemanager和nodemanager)

调用jar包

yarn jar jars/mr-test01.jar com.bigdata.mr.WordCountApp /hello /out-1

map的日志输出：
===sop==map初始化setUp方法
===sop==map方法的输入v1: hello you
===sop==map方法的输出k2: hello, v2: 1
===sop==map方法的输出k2: you, v2: 1
===sop==map方法的输入v1: hello me
===sop==map方法的输出k2: hello, v2: 1
===sop==map方法的输出k2: me, v2: 1
===sop==map方法的输入v1: hello she
===sop==map方法的输出k2: hello, v2: 1
===sop==map方法的输出k2: she, v2: 1

===sop==map cleanup方法

====sop==reducer输入的内容为：k2: hello, v2s: org.apache.hadoop.mapreduce.task.ReduceContextImpl $ValueIterable@260e86a1 ====sop==reducer输出的内容为：k3: hello, v3: 3 ====sop==reducer输入的内容为：k2: me, v2s: org.apache.hadoop.mapreduce.task.ReduceContextImpl$ ValueIterable@260e86a1
====sop==reducer输出的内容为：k3: me, v3: 1
====sop==reducer输入的内容为：k2: she, v2s: org.apache.hadoop.mapreduce.task.ReduceContextImpl $ValueIterable@260e86a1 ====sop==reducer输出的内容为：k3: she, v3: 1 ====sop==reducer输入的内容为：k2: you, v2s: org.apache.hadoop.mapreduce.task.ReduceContextImpl$ ValueIterable@260e86a1

====sop==reducer输出的内容为：k3: you, v3: 1

MapReduce的执行过程
Map
1、block中的数据被InputSplit切割一个个的split分片，然后每一个split分片中的数据被RecordReader读取，提取每一行的记录，用一个
标识标记这一个行的记录k1, v1,其中k1表示这一行记录的在文件中的偏移量，v1表示这一行记录的具体内容。
2、这样框架会调用Mapper类的map方法（被我们重写了），我们只需要获取v1的值，对其进行业务操作,以hello you\nhello me来说明
需要将这一行内容hello you ==>