hadoop day06（MapReduce）

姚circle

已于 2022-11-28 11:11:17 修改

阅读量293

点赞数

分类专栏： hadoop 文章标签： hadoop 大数据分布式

于 2022-11-21 08:50:56 首次发布

本文链接：https://blog.csdn.net/qq_53822083/article/details/127916212

版权

hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

MapReduce

1. easily writing applications 【轻松编写应用程序】
2.process vast amounts of data: 【处理大量数据】
- 1.n-parallel on large clusters (thousands of nodes) of 【并行方式处理数据】
  多个task
  单核和多核
- 2.commodity hardware in a reliable 【了解】
- 3.fault-tolerant manner.

容错
有重试机制

1.mapreduce

1. easily writingapplications：
- 1.非常多是接口
  业务逻辑 + MR api =》开发完mr程序 =》提交到yarn 运行我们的程序
2. mr 使用与离线数据计算[不适合]
3. mapreduce 去统计 wordcount

问题：

1.什么是Map阶段
2.什么是Reduce阶段
3.Map task个数由什么决定
4.Reduce task个数由什么决定的
5.什么是shuffle？
6.reduce阶段一定要有吗
7.partition分区什么是分区？为什么分区？
8.input 文件切片是什么？
大数数据据处理三段论
- input
- 处理
- output
4.mapreduce 整个流程
（input）<k1,v1> -> map -> <k2,v2> ->reduce -> <k3,v3>(output)
- 1. 整个流程都是对 kv 进行开发
- 2.每个阶段的输出都是kv
- 3.kv 数据类型 hava to be serializable
  - 1.mplement the Writable interface
    deserializable
    serializable 场景就是把数据进行网络传输
  - 2. key classes have to implement the WritableComparable：
    key ：既要实现序列化和排序
    value ：只需实现序列化
wc.data
a,a,a,b,b
x,x,x,y
mapreduce为例分析 wordcount案例
- input
  文件里的数据一行一行分析
  
  a,a,a,b,b
  x,x,x,y
  
  k:行号
  v:每行数据
- map：一一映射
  y=f(x)
  - 1.按照分隔符进行拆分每个单词，每个单词赋值为1
    
    x,x,x,y -> (x,1)
      (x,1)
      (x,1)
      (y,1)
    k ：单词
    v：次数
    reduce
  - reduce : 归约集合
    把相同的key“拉到一起”“做一些事”
    - “拉到一起”
      （x,<1,1,1>）
      （y,<1>）
    - “做一些事”：sum
      （x,3）
      （y,1）
- wc案例
  - mapreduce
    模板
    - 1.mapper -- map阶段代码
    - 2.reducer -- reduce阶段代码
    - 3.driver ---- 定义如何启动map reduce
      
      (input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3, v3> (output)
  - 1.mapper阶段
    Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>KEYIN ：输入数据的key类型值
    - VALUEIN ：输入数据的value的数据类型
    - KEYOUT ：输出数据的key的数据类型
    - VALUEOUT ：输出数据的vaue的数据类型
    - setup 每个map task 开始会执行一次
    - cleanup 每个map task 结束会执行一次
  - 2.reduce阶段
    Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
    - KEYIN ：输入数据的key类型值
    - VALUEIN ：输入数据的value的数据类型
    - KEYOUT ：输出数据的key的数据类型
    - VALUEOUT ：输出数据的vaue的数据类型
    - mapreduce过程：
      - input output
      - map阶段的输出作为reduce阶段的输入

package com.bigdata.mapreuce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;

/**
 * @author sxwang
 * 11 18 11:18
 */
public class WordCount {
    /**
     * map阶段
     */
    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {

            /**
             * value: x,x,x,y
             *
             *  words= line.split(",")
             *
             *  (x,1)
             */

            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    /**
     * reduce阶段
     */
    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        /**
         * map out :
         * (x,1)
         (x,1)
         (x,1)
         (y,1)

         reduce:
            1.把相同的key “拉倒一起” shuffle:
         (x,<1,1,1>)
         (y,<1>)
         */
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {

            /**
             * “做一些事情”:
             *
             * (x,<1,1,1>)
             *
             * (x,3)
             */
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    /**
     * driver
     */
    public static void main(String[] args) throws Exception {

        String input="data/wc.data";
        String output="out";
        Configuration conf = new Configuration();
        //1.设置 作业名称
        Job job = Job.getInstance(conf, "word count");
        //2.设置map reduce 执行代码的主类
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setReducerClass(IntSumReducer.class);
        //3.指定 oupput kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //4. 设置数据源路径 数据输出路径
        FileInputFormat.addInputPath(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));
        //5. 提交mr yarn
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

案例
统计每个手机号的上行总流量下行总流量总流量？

1363157985066    13726238888   00-FD-07-A4-72-B8:CMCC   120.196.100.82   i02.c.aliimg.com       24   27   10000   20000   200 (q)
时间戳手机号    协议   ip 广告url xx xx 上行流量下行流量 status

思路：

sql ：
select
sum(up) as sum_up,
sum(down) as sum_down,
sum(up)+sum(down) as all
from xx
group by phone

mr: k v
   map:
       1.读取每一行数据拆分
           phone up down
           k v:up down
   reduce:
       phone
       k values<(up down),(up down) >

       聚合：
           sum_up
           sum_down
           all
       k,sum_up sum_down all

access.log 中的数据
1363157985066    13726230503   00-FD-07-A4-72-B8:CMCC   120.196.100.82   i02.c.aliimg.com       24   27   2481   24681   200
1363157995052    13826544101   5C-0E-8B-C7-F1-E0:CMCC   120.197.40.4           4   0   264   0   200
1363157991076    13926435656   20-10-7A-28-CC-0A:CMCC   120.196.100.99           2   4   132   1512   200
1363154400022    13926251106   5C-0E-8B-8B-B1-50:CMCC   120.197.40.4           4   0   240   0   200
1363157993044    18211575961   94-71-AC-CD-E6-18:CMCC-EASY   120.196.100.99   iface.qiyi.com   视频网站   15   12   1527   2106   200
1363157995074    84138413   5C-0E-8B-8C-E8-20:7DaysInn   120.197.40.4   122.72.52.12       20   16   4116   1432   200
1363157993055    13560439658   C4-17-FE-BA-DE-D9:CMCC   120.196.100.99           18   15   1116   954   200
1363157995033    15920133257   5C-0E-8B-C7-BA-20:CMCC   120.197.40.4   sug.so.360.cn   信息安全   20   20   3156   2936   200
1363157983019    13719199419   68-A1-B7-03-07-B1:CMCC-EASY   120.196.100.82           4   0   240   0   200
1363157984041    13660577991   5C-0E-8B-92-5C-20:CMCC-EASY   120.197.40.4   s19.cnzz.com   站点统计   24   9   6960   690   200
1363157973098    15013685858   5C-0E-8B-C7-F7-90:CMCC   120.197.40.4   rank.ie.sogou.com   搜索引擎   28   27   3659   3538   200
1363157986029    15989002119   E8-99-C4-4E-93-E0:CMCC-EASY   120.196.100.99   www.umeng.com   站点统计   3   3   1938   180   200
1363157992093    13560439658   C4-17-FE-BA-DE-D9:CMCC   120.196.100.99           15   9   918   4938   200
1363157986041    13480253104   5C-0E-8B-C7-FC-80:CMCC-EASY   120.197.40.4           3   3   180   180   200
1363157984040    13602846565   5C-0E-8B-8B-B6-00:CMCC   120.197.40.4   2052.flash2-http.qq.com   综合门户   15   12   1938   2910   200
1363157995093    13922314466   00-FD-07-A2-EC-BA:CMCC   120.196.100.82   img.qfc.cn       12   12   3008   3720   200
1363157982040    13502468823   5C-0A-5B-6A-0B-D4:CMCC-EASY   120.196.100.99   y0.ifengimg.com   综合门户   57   102   7335   110349   200
1363157986072    18320173382   84-25-DB-4F-10-1A:CMCC-EASY   120.196.100.99   input.shouji.sogou.com   搜索引擎   21   18   9531   2412   200
1363157990043    13925057413   00-1F-64-E1-E6-9A:CMCC   120.196.100.55   t3.baidu.com   搜索引擎   69   63   11058   48243   200
1363157988072    13760778710   00-FD-07-A4-7B-08:CMCC   120.196.100.82           2   2   120   120   200
1363157985066    13726238888   00-FD-07-A4-72-B8:CMCC   120.196.100.82   i02.c.aliimg.com       24   27   2481   24681   200
1363157993055    13560436666   C4-17-FE-BA-DE-D9:CMCC   120.196.100.99           18   15   1116   954   200
1363157985066    13726238888   00-FD-07-A4-72-B8:CMCC   120.196.100.82   i02.c.aliimg.com       24   27   10000   20000   200
package com.bigdata.mapreuce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class PhoneApp {
    /**
     * driver
     * @param args
     */
    public static void main(String[] args) throws Exception {

        String input="data/access.log";
        String output="out/phone1";
        Configuration conf = new Configuration();
        //0.todo... 删除目标路径
        FileUtils.deletePath(conf,output);

        //1.设置 作业名称
        Job job = Job.getInstance(conf, "PhoneAPP");
        //2.设置map reduce 执行代码的主类
        job.setJarByClass(PhoneApp.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        //3.指定 oupput kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        //4. 设置数据源路径 数据输出路径
        FileInputFormat.addInputPath(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));
        //5. 提交mr yarn
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

    public static class  MyMapper extends Mapper<Object,Text,Text,Text>{

        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String[] split = value.toString().split("\t");
            String phone = split[1];
            String up = split[split.length - 3];
            String down = split[split.length - 2];

            context.write(new Text(phone), new Text(up + "\t" + down ));

        }
    }

    public static class MyReducer extends Reducer<Object,Text,Text,Text>{
        @Override
        protected void reduce(Object key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            long up_sum = 0;
            long down_sum = 0;
            long all = 0;
            for (Text value : values) {
                String[] split = value.toString().split("\t");
                String up = split[0];
                String down = split[1];

                up_sum += Long.parseLong(up);
                down_sum += Long.parseLong(down);
            }
            all = up_sum + down_sum;
            context.write((Text) key,new Text(up_sum + "\t" + down_sum + "\t" + all));
        }
    }


}