MapReduce代码示例

最新推荐文章于 2022-03-06 23:49:26 发布

whjcsdnwhj

最新推荐文章于 2022-03-06 23:49:26 发布

阅读量857

点赞数

本文链接：https://blog.csdn.net/whjcsdnwhj/article/details/78462605

版权

 
 Google三篇论文 Hadoop 

 
 GFS --> HDFS 

 
 mapreduce --> Mapreduce 

 
 bigtable --> HBase 

 
 Hadoop 

 
 ** common 

 
 ** HDFS 

 
 ** mapreduce 

 
 ** YARN  

 
 mapreduce 

 
 ** 分布式离线计算模型 

 
 ** 周期性(每天、每周、每月)分析历史数据 

 
 ** Mapreduce分为两个阶段 

 
 ** map阶段 : 找出关键数据，会产生多个mapper 

 
 默认情况，一个split对应一个mapper，一个block对应一个mapper 

 
 ** reduce阶段 ： 把map阶段运行的结果进行合并 

 
 例如：一个100G文件： 

 
 被分隔成多个block,分散存放在不同的datanode(通常一个节点即是datanode又是nodemanager) 

 
 这些nodemanager会为每个split启动一个mapper 

 
 ** Input --> map --> reduce --> output 

 
 ** 整个过程数据流都是键值对 

 
 ========wordcount================================================ 

 
 以此例分析Mapreduce: 

 
 需求： 单词统计，分隔符是\t 

 
 hadoop mapreduce 

 
 spark storm 

 
 map hadoop mapreduce 

 
 reduce storm hadoop 

 
 hbase map storm  

 
 ** map 输入 

 
 ** 按行读取数据，然后转换成key-value 

 
 <0,hadoop mapreduce> 

 
 <15,spark storm> 

 
 <26,map hadoop mapreduce> 

 
 <40,reduce storm hadoop> 

 
 <50,hbase map storm> 

 
 ** map 输出 

 
 <hadoop,1> <mapreduce,1> <spark,1> <storm,1> <map,1> <hadoop,1> ... 

 
 ** 中间结果临时存储在本地目录，而非HDFS 

 
 ** reduce 

 
 ** 从相关nodemanager拉取map输出的结果 

 
 ** 运行reduce函数 

 
 ** 输入 

 
 <hadoop,(1,1,1)> <storm,(1,1,1)> <hbase,1>　...  

 
 ** 输出 

 
 hadoop 3 

 
 storm 3 

 
 hbase 1 ... 

 
 ** 结果写入HDFS 

 
 hadoop 3 

 
 storm 3 

 
 hbase 1 ... 

 
 -------------------------------- 

 
 编写Mapreduce代码实现wordcount  

 
 ** 八股文模型 

 
 ** mapper class --> mapper 

 
 ** reducer class --> reducer 

 
 ** Driver --> 创建、设置、运行Job 

 
 1、(可选) 

 
 创建"Source Folder"：src/main/resources目录，用来存放core-site.xml等 

 
 拷贝log4j.properties 

 
 $ cp /opt/modules/hadoop-2.5.0/etc/hadoop/log4j.properties /home/tom/workspace/myhdfs/src/main/resources 

 
 2、编写代码 

 
 Hadoop常用类型：  

 
 IntWritable LongWritable Text NullWritable 

 
 package com.myblue.myhdfs; 

 
 import java.io.IOException; 

 
 import org.apache.hadoop.conf.Configuration; 

 
 import org.apache.hadoop.fs.FileSystem; 

 
 import org.apache.hadoop.fs.Path; 

 
 import org.apache.hadoop.io.LongWritable; 

 
 import org.apache.hadoop.io.Text; 

 
 import org.apache.hadoop.mapreduce.Job; 

 
 import org.apache.hadoop.mapreduce.Mapper; 

 
 import org.apache.hadoop.mapreduce.Reducer; 

 
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 

 
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 

 
 public class WordCountMapReduce{ 

 
 //mapper 

 
 public static class WordCountMapper extends 

 
 Mapper<LongWritable, Text, Text, LongWritable> { 

 
 protected void map(LongWritable key, Text value, Context context) 

 
 throws IOException, InterruptedException { 

 
 //输入行 

 
 System.out.println(key.get()); 

 
 String lineValue = value.toString(); 

 
 String[] splits = lineValue.split("\t"); 

 
 Text mapOutputKey = new Text(); //输出键 

 
 LongWritable mapOutputValue = new LongWritable(1); //输出值，本例恒为1 

 
 for (String s : splits) { 

 
 mapOutputKey.set(s); 

 
 context.write(mapOutputKey, mapOutputValue); 

}

}

}

 
 //reducer 

 
 public static class WordCountReducer extends 

 
 Reducer<Text, LongWritable, Text, LongWritable> { 

 
 protected void reduce(Text key, Iterable<LongWritable> values, 

 
 Context context) throws IOException, InterruptedException { 

 
 long sum = 0; 

 
 for (LongWritable value : values) { 

 
 sum += value.get(); 

}

 
 LongWritable outputValue = new LongWritable(); 

 
 outputValue.set(sum); 

 
 context.write(key, outputValue); 

}

}

 
 public static void main(String[] args) throws Exception { 

 
 //args=new String[]{"/input","/output"}; 

 
 Configuration conf = new Configuration(); 

 
 //创建作业 

 
 Job job = Job.getInstance(conf); 

 
 job.setJarByClass(WordCountMapReduce.class); 

 
 //输入路径 

 
 FileInputFormat.addInputPath(job, new Path(args[0])); 

 
 //输出路径 

 
 Path outPath = new Path(args[1]); 

 
 FileSystem dfs = FileSystem.get(conf); 

 
 if (dfs.exists(outPath)) { 

 
 dfs.delete(outPath, true); 

}

 
 FileOutputFormat.setOutputPath(job, outPath); 

 
 //mapper 

 
 job.setMapperClass(WordCountMapper.class); 

 
 job.setMapOutputKeyClass(Text.class); 

 
 job.setMapOutputValueClass(LongWritable.class); 

 
 //reducer 

 
 job.setReducerClass(WordCountReducer.class); 

 
 job.setOutputKeyClass(Text.class); 

 
 job.setOutputValueClass(LongWritable.class); 

 
 //提交作业  

 
 System.exit(job.waitForCompletion(true) ? 0 : 1); 

}

}

 
 3、运行 

 
 在eclipse里直接运行（需要core-site.xml） 

 
 PS：打包运行 

 
 a) 启动Hadoop (若是报进程已启动的错误，可以到tmp目录下删除对应的pid文件 --ls /tmp/*.pid) 

 
 b) 将WordCountMapReduce.java文件导出为jar(需要填写文件名，如：XXX.jar) 

 
 c) /opt/modules/hadoop-2.5.0/bin/yarn jar WordCountMapReduce.jar com.myblue.myhdfs.WordCountMapReduce /input /output 

 
 ====计算每个省份的PV======================================================= 

 
 数据来源：  

 
 ** web服务器的日志文件(20150828) 

 
 案例: 

 
 ** web服务器的日志文件(20150828) 

 
 ** web服务器生产日志文件 

 
 ** 数据字典 

 
 ** 36个字段 

 
 需求： 

 
 ** 计算每个省份的PV 

 
 ** provinceId 

 
 常见的统计指标： 

 
 ** PV page view 

 
 用户每访问一个页面就记录一条日志,如果是多次访问同一个页面会累计 

 
 ** UV unique visitor  

 
 独立访客（cookie）  

 
 ** 独立IP 

 
 思路：依据provinceId字段去统计 

 
 ** 把provinceId作为key, value是1 

 
 package com.myblue.myhdfs; 

 
 import java.io.IOException; 

 
 import org.apache.commons.lang.StringUtils; 

 
 import org.apache.hadoop.conf.Configuration; 

 
 import org.apache.hadoop.fs.FileSystem; 

 
 import org.apache.hadoop.fs.Path; 

 
 import org.apache.hadoop.io.IntWritable; 

 
 import org.apache.hadoop.io.LongWritable; 

 
 import org.apache.hadoop.io.Text; 

 
 import org.apache.hadoop.mapreduce.Job; 

 
 import org.apache.hadoop.mapreduce.Mapper; 

 
 import org.apache.hadoop.mapreduce.Reducer; 

 
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 

 
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 

 
 import org.apache.hadoop.conf.Configured; 

 
 import org.apache.hadoop.util.Tool; 

 
 import org.apache.hadoop.util.ToolRunner; 

 
 public class WebPvMapReduce extends Configured implements Tool { 

 
 //map 

 
 public static class ModuleMapper extends 

 
 Mapper<LongWritable, Text, IntWritable, IntWritable> { 

 
 protected void map(LongWritable key, Text value, Context context) 

 
 throws IOException, InterruptedException { 

 
 String lineValue = value.toString(); 

 
 String[] splits = lineValue.split("\t"); 

 
 //过滤非法数据，若该行数据少于30字段，则视为非法数据，不再处理 

 
 if (splits.length < 30) { 

 
 //参数：计数器组，计数器名 

 
 context.getCounter("Web Pv Counter", "Length limit 30").increment(1L); 

 
 return; 

}

 
 String url = splits[1];// 第2个字段为url 

 
 if (StringUtils.isBlank(url)) { 

 
 context.getCounter("Web Pv Counter", "Url is Blank") .increment(1L); 

 
 return; 

}

 
 String provinceIdValue = splits[23];// 第24个字段为provinceId 

 
 if (StringUtils.isBlank(provinceIdValue)) { 

 
 context.getCounter("Web Pv Counter", "Province is Blank").increment(1L); 

 
 return; 

}

 
 int provinceId = 0; 

 
 try { 

 
 provinceId = Integer.parseInt(provinceIdValue); 

 
 } catch (Exception e) { 

 
 System.out.println(e); 

 
 return; 

}

 
 IntWritable mapOutputKey = new IntWritable(); 

 
 mapOutputKey.set(provinceId); 

 
 IntWritable mapOutputValue = new IntWritable(1);//本例输出恒为1 

 
 context.write(mapOutputKey, mapOutputValue); 

}

}

 
 //reduce 

 
 public static class ModuleReducer extends 

 
 Reducer<IntWritable, IntWritable, IntWritable, IntWritable> { 

 
 protected void reduce(IntWritable key, Iterable<IntWritable> values, 

 
 Context context) throws IOException, InterruptedException { 

 
 int sum = 0; 

 
 for (IntWritable value : values) { 

 
 sum += value.get(); 

}

 
 IntWritable outputValue = new IntWritable(); 

 
 outputValue.set(sum); 

 
 context.write(key, outputValue); 

}

}

 
 public int run(String[] args) throws Exception { 

 
 // 创建作业 

 
 Configuration conf = new Configuration(); 

 
 Job job = Job.getInstance(conf); 

 
 job.setJarByClass(getClass()); 

 
 // 输入、输出路径 

 
 FileInputFormat.addInputPath(job, new Path(args[0])); 

 
 Path outPath = new Path(args[1]); 

 
 FileSystem dfs = FileSystem.get(conf); 

 
 if (dfs.exists(outPath)) { 

 
 dfs.delete(outPath, true); 

}

 
 FileOutputFormat.setOutputPath(job, outPath); 

 
 // mapper 

 
 job.setMapperClass(ModuleMapper.class); 

 
 job.setMapOutputKeyClass(IntWritable.class); 

 
 job.setMapOutputValueClass(IntWritable.class); 

 
 // reducer 

 
 job.setReducerClass(ModuleReducer.class); 

 
 job.setOutputKeyClass(IntWritable.class); 

 
 job.setOutputValueClass(IntWritable.class); 

 
 // 提交作业 

 
 return job.waitForCompletion(true) ? 0 : 1; 

}

 
 public static void main(String[] args) throws Exception { 

 
 args=new String[]{"/input2","/output2"}; 

 
 // 使用ToolRunner运行作业 

 
 Configuration conf = new Configuration(); 

 
 int status = ToolRunner.run(conf, new WebPvMapReduce(), args); 

 
 System.exit(status); 

}

}

 
 测试： 

 
 $ hdfs dfs -mkdir /input2 

 
 $ hdfs dfs -put 2015082818 /input2 

 
 $ /opt/modules/hadoop-2.5.0/bin/yarn jar WebPvMapReduce.jar com.myblue.myhdfs.WebPvMapReduce /input2 /output2  

 
 ====YARN================================================ 

 
 YARN架构 

 
 ** 集群资源管理、作业和任务管理  

 
 ** hadoop2.0以前 

 
 ** jobtracker 

 
 ** tasktracker 

 
 ** hadoop2.0以后 

 
 ** resourcemanager 

 
 ** nodemanager 

 
 ** resourcemanager 

 
 --接收客户端请求 bin/yarn jar xxx.jar wordcount /input /output 

 
 --启动/监控ApplicationMaster 

 
 --监控NodeManager 

 
 --资源分配与调度 

 
 ** nodemanager 

 
 --单个节点上的资源管理 

 
 --处理来自ResourceManager的命令 

 
 --处理来自ApplicationMaster的命令 

 
 ** applicationmaster 

 
 --当前这个任务的管理者,任务运行结束,applicationmaster会消失 

 
 --数据切分 

 
 --为应用程序申请资源，并分配给任务使用 

 
 --任务监控与容错 

 
 ** Container 

 
 --对任务运行环境的抽象，封装了CPU、内存等多维资源以及环境变量、启动命令等任务运行相关的信息 

 
 MapReduce的运行流程 

 
 1、client向集群提交job任务,resourcemanager接收到任务请求 

 
 2、resourcemanager收到请求以后,会选择一台nodemanager启动一个applicationmaster 

 
 3、applicationmaster向resourcemanager申请资源(运行当前job任务需要哪些nodemanager、每个nodemanager需要多少CPU、MEM...) 

 
 4、resourcemanager把对应资源信息响应给applicationmaster 

 
 5、applicationmaster收到以后,调度指挥其他nodemanager运行任务 

 
 6、相关nodemanager接受任务并运行任务(map\reduce) 

 
 7、nodemanager任务运行结束以后会向applicationmaster报告 

 
 8、applicationmaster向resourcemanager报告，并反馈结果给client 

 
 再次认识Hadoop 

 
 ** HDFS 

 
 --分布式文件系统的架构、存储数据 

 
 --namenode 

 
 --datanode 

 
 ** yarn 

 
 --集群资源管理、作业和任务管理 

 
 --resourcemanager 

 
 --nodemanager 

 
 通常集群资源配置：  

 
 **　内存 

 
 yarn.nodemanager.resource.memory-mb 8G 64G 128G  

 
 ** CPU  

 
 yarn.nodemanager.resource.cpu-vcores 8核 16核  

 
 ** 内存不够，会直接影响job任务运行成败 

 
 ** CPU不够，只会影响job任务运行的快慢 

whjcsdnwhj

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫