MapReduce
- 1. easily writing applications 【轻松编写应用程序】
- 2.process vast amounts of data: 【处理大量数据】
- 1.n-parallel on large clusters (thousands of nodes) of 【并行方式处理数据】
多个task
单核 和 多核 - 2.commodity hardware in a reliable 【了解】
- 3.fault-tolerant manner.
- 1.n-parallel on large clusters (thousands of nodes) of 【并行方式处理数据】
容错
有重试机制
1.mapreduce
-
1. easily writingapplications:
- 1.非常多是接口
业务逻辑 + MR api =》 开发完mr程序 =》 提交到yarn 运行我们的程序
- 1.非常多是接口
-
2. mr 使用与离线数据计算[不适合]
-
3. mapreduce 去统计 wordcount
问题:
- 1.什么是Map阶段
- 2.什么是Reduce阶段
- 3.Map task个数由什么决定
- 4.Reduce task个数由什么决定的
- 5.什么是shuffle?
- 6.reduce阶段一定要有吗
- 7.partition分区 什么是分区?为什么分区?
- 8.input 文件切片是什么?
- 大数数据据处理三段论
- input
- 处理
- output
-
4.mapreduce 整个流程
(input)<k1,v1> -> map -> <k2,v2> ->reduce -> <k3,v3>(output)- 1. 整个流程都是对 kv 进行开发
- 2.每个阶段的输出都是kv
- 3.kv 数据类型 hava to be serializable
- 1.mplement the Writable interface
deserializable
serializable 场景就是把数据进行网络传输 - 2. key classes have to implement the WritableComparable:
key : 既要实现序列化和排序
value : 只需实现序列化
- 1.mplement the Writable interface
- wc.data
a,a,a,b,b
x,x,x,y - mapreduce为例 分析 wordcount案例
- input
文件里的数据一行一行分析
a,a,a,b,b
x,x,x,y
k:行号
v:每行数据 - map:一一映射
y=f(x)- 1.按照分隔符进行拆分每个单词 ,每个单词赋值为1
x,x,x,y -> (x,1)
(x,1)
(x,1)
(y,1)
k :单词
v: 次数
reduce - reduce : 归约 集合
把相同的key“拉到一起”“做一些事”- “拉到一起”
(x,<1,1,1>)
(y,<1>) - “做一些事”:sum
(x,3)
(y,1)
- “拉到一起”
- 1.按照分隔符进行拆分每个单词 ,每个单词赋值为1
- wc案例
- mapreduce
模板- 1.mapper -- map阶段代码
- 2.reducer -- reduce阶段代码
- 3.driver ---- 定义如何启动map reduce
(input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3, v3> (output)
- 1.mapper阶段
Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>KEYIN : 输入数据的key类型值- VALUEIN : 输入数据的value的数据类型
- KEYOUT : 输出数据的key的数据类型
- VALUEOUT :输出数据的vaue的数据类型
- setup 每个map task 开始会执行一次
- cleanup 每个map task 结束会执行一次
- 2.reduce阶段
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>- KEYIN : 输入数据的key类型值
- VALUEIN : 输入数据的value的数据类型
- KEYOUT : 输出数据的key的数据类型
- VALUEOUT :输出数据的vaue的数据类型
- mapreduce过程:
- input output
- map阶段的输出作为reduce阶段的输入
- mapreduce
- input
package com.bigdata.mapreuce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;
/**
* @author sxwang
* 11 18 11:18
*/
public class WordCount {
/**
* map阶段
*/
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
/**
* value: x,x,x,y
*
* words= line.split(",")
*
* (x,1)
*/
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
/**
* reduce阶段
*/
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
/**
* map out :
* (x,1)
(x,1)
(x,1)
(y,1)
reduce:
1.把相同的key “拉倒一起” shuffle:
(x,<1,1,1>)
(y,<1>)
*/
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
/**
* “做一些事情”:
*
* (x,<1,1,1>)
*
* (x,3)
*/
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
/**
* driver
*/
public static void main(String[] args) throws Exception {
String input="data/wc.data";
String output="out";
Configuration conf = new Configuration();
//1.设置 作业名称
Job job = Job.getInstance(conf, "word count");
//2.设置map reduce 执行代码的主类
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
//3.指定 oupput kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//4. 设置数据源路径 数据输出路径
FileInputFormat.addInputPath(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
//5. 提交mr yarn
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
案例
统计每个手机号的 上行总流量 下行总流量 总流量 ?1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 10000 20000 200 (q)
时间戳 手机号 协议 ip 广告url xx xx 上行流量 下行流量 status思路:
sql :
select
sum(up) as sum_up,
sum(down) as sum_down,
sum(up)+sum(down) as all
from xx
group by phonemr: k v
map:
1.读取每一行数据 拆分
phone up down
k v:up down
reduce:
phone
k values<(up down),(up down) >聚合:
sum_up
sum_down
all
k,sum_up sum_down all- access.log 中的数据
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com 站点统计 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com 搜索引擎 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com 站点统计 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com 综合门户 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com 综合门户 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com 搜索引擎 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com 搜索引擎 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200
1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 10000 20000 200package com.bigdata.mapreuce; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class PhoneApp { /** * driver * @param args */ public static void main(String[] args) throws Exception { String input="data/access.log"; String output="out/phone1"; Configuration conf = new Configuration(); //0.todo... 删除目标路径 FileUtils.deletePath(conf,output); //1.设置 作业名称 Job job = Job.getInstance(conf, "PhoneAPP"); //2.设置map reduce 执行代码的主类 job.setJarByClass(PhoneApp.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); //3.指定 oupput kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); //4. 设置数据源路径 数据输出路径 FileInputFormat.addInputPath(job, new Path(input)); FileOutputFormat.setOutputPath(job, new Path(output)); //5. 提交mr yarn System.exit(job.waitForCompletion(true) ? 0 : 1); } public static class MyMapper extends Mapper<Object,Text,Text,Text>{ @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] split = value.toString().split("\t"); String phone = split[1]; String up = split[split.length - 3]; String down = split[split.length - 2]; context.write(new Text(phone), new Text(up + "\t" + down )); } } public static class MyReducer extends Reducer<Object,Text,Text,Text>{ @Override protected void reduce(Object key, Iterable<Text> values, Context context) throws IOException, InterruptedException { long up_sum = 0; long down_sum = 0; long all = 0; for (Text value : values) { String[] split = value.toString().split("\t"); String up = split[0]; String down = split[1]; up_sum += Long.parseLong(up); down_sum += Long.parseLong(down); } all = up_sum + down_sum; context.write((Text) key,new Text(up_sum + "\t" + down_sum + "\t" + all)); } } }