MapReduce代码示例

Google三篇论文 Hadoop
GFS --> HDFS
mapreduce --> Mapreduce
bigtable --> HBase

Hadoop
** common
** HDFS
** mapreduce
** YARN

mapreduce
** 分布式离线计算模型
** 周期性(每天、每周、每月)分析历史数据
** Mapreduce分为两个阶段
** map阶段 : 找出关键数据,会产生多个mapper
默认情况,一个split对应一个mapper,一个block对应一个mapper
** reduce阶段 : 把map阶段运行的结果进行合并
例如:一个100G文件:
被分隔成多个block,分散存放在不同的datanode(通常一个节点即是datanode又是nodemanager)
这些nodemanager会为每个split启动一个mapper

** Input --> map --> reduce --> output
** 整个过程数据流都是键值对
========wordcount================================================

以此例分析Mapreduce:

需求: 单词统计,分隔符是\t
hadoop mapreduce
spark storm
map hadoop mapreduce
reduce storm hadoop
hbase map storm

** map 输入
** 按行读取数据,然后转换成key-value
<0,hadoop mapreduce>
<15,spark storm>
<26,map hadoop mapreduce>
<40,reduce storm hadoop>
<50,hbase map storm>
** map 输出
<hadoop,1> <mapreduce,1> <spark,1> <storm,1> <map,1> <hadoop,1> ...
** 中间结果临时存储在本地目录,而非HDFS
** reduce
** 从相关nodemanager拉取map输出的结果
** 运行reduce函数
** 输入
<hadoop,(1,1,1)> <storm,(1,1,1)> <hbase,1> ...
** 输出
hadoop 3
storm 3
hbase 1 ...
** 结果写入HDFS
hadoop 3
storm 3
hbase 1 ...

--------------------------------

编写Mapreduce代码实现wordcount
** 八股文模型
** mapper class --> mapper
** reducer class --> reducer
** Driver --> 创建、设置、运行Job

1、(可选)
创建"Source Folder":src/main/resources目录,用来存放core-site.xml等
拷贝log4j.properties
$ cp /opt/modules/hadoop-2.5.0/etc/hadoop/log4j.properties /home/tom/workspace/myhdfs/src/main/resources

2、编写代码
Hadoop常用类型:
IntWritable LongWritable Text NullWritable

package com.myblue.myhdfs;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountMapReduce{

//mapper
public static class WordCountMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//输入行
System.out.println(key.get());
String lineValue = value.toString();
String[] splits = lineValue.split("\t");
Text mapOutputKey = new Text(); //输出键
LongWritable mapOutputValue = new LongWritable(1); //输出值,本例恒为1
for (String s : splits) {
mapOutputKey.set(s);
context.write(mapOutputKey, mapOutputValue);
}
}
}

//reducer
public static class WordCountReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {

protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {

long sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
LongWritable outputValue = new LongWritable();
outputValue.set(sum);
context.write(key, outputValue);
}
}

public static void main(String[] args) throws Exception {
//args=new String[]{"/input","/output"};
Configuration conf = new Configuration();

//创建作业
Job job = Job.getInstance(conf);
job.setJarByClass(WordCountMapReduce.class);

//输入路径
FileInputFormat.addInputPath(job, new Path(args[0]));
//输出路径
Path outPath = new Path(args[1]);
FileSystem dfs = FileSystem.get(conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}
FileOutputFormat.setOutputPath(job, outPath);

//mapper
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);

//reducer
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);

//提交作业
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3、运行
在eclipse里直接运行(需要core-site.xml)


PS:打包运行
a) 启动Hadoop (若是报进程已启动的错误,可以到tmp目录下删除对应的pid文件 --ls /tmp/*.pid)
b) 将WordCountMapReduce.java文件导出为jar(需要填写文件名,如:XXX.jar)
c) /opt/modules/hadoop-2.5.0/bin/yarn jar WordCountMapReduce.jar com.myblue.myhdfs.WordCountMapReduce /input /output


====计算每个省份的PV=======================================================

数据来源:
** web服务器的日志文件(20150828)

案例:
** web服务器的日志文件(20150828)
** web服务器生产日志文件
** 数据字典
** 36个字段

需求:
** 计算每个省份的PV
** provinceId

常见的统计指标:
** PV page view
用户每访问一个页面就记录一条日志,如果是多次访问同一个页面会累计
** UV unique visitor
独立访客(cookie)
** 独立IP

思路:依据provinceId字段去统计
** 把provinceId作为key, value是1

package com.myblue.myhdfs;

import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WebPvMapReduce extends Configured implements Tool {

//map
public static class ModuleMapper extends
Mapper<LongWritable, Text, IntWritable, IntWritable> {

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String lineValue = value.toString();
String[] splits = lineValue.split("\t");
//过滤非法数据,若该行数据少于30字段,则视为非法数据,不再处理
if (splits.length < 30) {
//参数:计数器组,计数器名
context.getCounter("Web Pv Counter", "Length limit 30").increment(1L);
return;
}

String url = splits[1];// 第2个字段为url
if (StringUtils.isBlank(url)) {
context.getCounter("Web Pv Counter", "Url is Blank") .increment(1L);
return;
}
String provinceIdValue = splits[23];// 第24个字段为provinceId
if (StringUtils.isBlank(provinceIdValue)) {
context.getCounter("Web Pv Counter", "Province is Blank").increment(1L);
return;
}

int provinceId = 0;
try {
provinceId = Integer.parseInt(provinceIdValue);
} catch (Exception e) {
System.out.println(e);
return;
}

IntWritable mapOutputKey = new IntWritable();
mapOutputKey.set(provinceId);
IntWritable mapOutputValue = new IntWritable(1);//本例输出恒为1
context.write(mapOutputKey, mapOutputValue);
}
}

//reduce
public static class ModuleReducer extends
Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {

protected void reduce(IntWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}

IntWritable outputValue = new IntWritable();
outputValue.set(sum);
context.write(key, outputValue);
}
}

public int run(String[] args) throws Exception {

// 创建作业
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(getClass());

// 输入、输出路径
FileInputFormat.addInputPath(job, new Path(args[0]));
Path outPath = new Path(args[1]);
FileSystem dfs = FileSystem.get(conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}
FileOutputFormat.setOutputPath(job, outPath);

// mapper
job.setMapperClass(ModuleMapper.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);

// reducer
job.setReducerClass(ModuleReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);

// 提交作业
return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
args=new String[]{"/input2","/output2"};
// 使用ToolRunner运行作业
Configuration conf = new Configuration();
int status = ToolRunner.run(conf, new WebPvMapReduce(), args);
System.exit(status);
}
}

测试:
$ hdfs dfs -mkdir /input2
$ hdfs dfs -put 2015082818 /input2
$ /opt/modules/hadoop-2.5.0/bin/yarn jar WebPvMapReduce.jar com.myblue.myhdfs.WebPvMapReduce /input2 /output2

====YARN================================================

YARN架构
** 集群资源管理、作业和任务管理
** hadoop2.0以前
** jobtracker
** tasktracker
** hadoop2.0以后
** resourcemanager
** nodemanager
** resourcemanager
--接收客户端请求 bin/yarn jar xxx.jar wordcount /input /output
--启动/监控ApplicationMaster
--监控NodeManager
--资源分配与调度
** nodemanager
--单个节点上的资源管理
--处理来自ResourceManager的命令
--处理来自ApplicationMaster的命令
** applicationmaster
--当前这个任务的管理者,任务运行结束,applicationmaster会消失
--数据切分
--为应用程序申请资源,并分配给任务使用
--任务监控与容错
** Container
--对任务运行环境的抽象,封装了CPU、内存等多维资源以及环境变量、启动命令等任务运行相关的信息


MapReduce的运行流程
1、client向集群提交job任务,resourcemanager接收到任务请求
2、resourcemanager收到请求以后,会选择一台nodemanager启动一个applicationmaster
3、applicationmaster向resourcemanager申请资源(运行当前job任务需要哪些nodemanager、每个nodemanager需要多少CPU、MEM...)
4、resourcemanager把对应资源信息响应给applicationmaster
5、applicationmaster收到以后,调度指挥其他nodemanager运行任务
6、相关nodemanager接受任务并运行任务(map\reduce)
7、nodemanager任务运行结束以后会向applicationmaster报告
8、applicationmaster向resourcemanager报告,并反馈结果给client

再次认识Hadoop
** HDFS
--分布式文件系统的架构、存储数据
--namenode
--datanode
** yarn
--集群资源管理、作业和任务管理
--resourcemanager
--nodemanager

通常集群资源配置:
** 内存
yarn.nodemanager.resource.memory-mb 8G 64G 128G
** CPU
yarn.nodemanager.resource.cpu-vcores 8核 16核
** 内存不够,会直接影响job任务运行成败
** CPU不够,只会影响job任务运行的快慢
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值