五个关于mapreduce的简单程序实现
mapreduce的简介
什么是mapreduce?
是一种分布式运算程序
一个mapreduce程序会分成两个阶段,map阶段和reduce阶段
map阶段会有一个实体程序,不用用户自己开发
reduce阶段也会有一个实体程序,不用用户自己开发
用户只需要开发map程序和reduce程序所要调用的数据处理逻辑方法
Map阶段的逻辑方法:xxxMapper.map()
Reduce阶段的逻辑方法:xxxReducer.reduce()
map阶段
框架中的map程序如何调用用户写的map()方法的?
map程序每读取一行数据,就调用一次map()方法,而且会将这一行数据的起始偏移量作为key,这一行的内容作为value,作为参数传给map(key,value,context)方法
reduce阶段
框架中的reduce程序如何调用用户写的reduce()方法的?
reduce程序会收到来自map程序输出的中间结果数据,而且相同key的数据会到达同一个reduce程序实例,比如reduce程序实例0可能会收到这样的一些数据。
A:1 A:1 A:1 C:1 C:1 C:1 X:1 X:1 X:1
reduce程序会将自己收到的程序按照key整理成一组一组,对一组数据调用一次reduce()方法来处理一次,并且会将数据作为参数传给reduce(key,迭代器values,context)方法。
mapreduce的运行机制
mapreduce程序如何运行?
mapreduce程序可以作为单机版程序在本地运行
mapreduce程序更应该作为分布式程序提交给yarn去运行
写一个yarn的客户端类(含main方法)
指定job的jar包所在的路径
指定job所要的mapper类reducer类以及map、reduce阶段输出到key、value的数据类型
指定job要处理的数据所在目录
指定job输出结果所在目录
然后用一个方法:waitForCompletion()向yarn的resourcemanager提交job即可
启动命令
启动客户端类:
用java-cp可以启动,但是需要手动设置大量的jar包和配置文件到classpath,不建议。
建议用hadoop jar pv.jar命令来启动,hadoop命令会自动设置好classpath(将hadoop安装目录中的所有jar包和配置文件加入到classpath中)
准备环境
mr程序运行环境准备----yarn集群配置和启动。
在/usr/hadoop/hadoop-2.7.7/etc/hadoop目录下的yarn-site.xml配置文件中。配置如下:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services </name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
yarn在启动的时候,也会看slaves中配置的机器,这些机器会作为yarn的nodemanager。
这样的话,nodemanager和datanode正好是相同的机器。
mv mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
mapreduce框架的程序可以以单机模式运行(本地模式),我们要让它分布式运行,就交给yarn,以分布式运行。
scp yarn-site.xml mapred-site.xml hadoop02:$PWD
……
启动yarn集群,start-yarn.sh
jps查看
yarn的启动与hdfs的启动无关,但是mapreduce程序肯定要访问hdfs中的数据,所以开启yarn后再开启hdfs。start-dfs.sh
第一个程序
统计用户的访问量
Map端:
Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
KEYIN是mr框架提供的程序读取到一行数据的起始偏移量
VALUEIN是mr框架提供的程序读取到一行数据的内容
KEYOUT是用户的逻辑处理方法处理完之后返回给框架的数据中的KEY的类型
VALUEOUT是用户的逻辑处理方法处理完之后返回给框架的数据中的VALUE的类型
Long、String、Integer一类的java类型不能再hadoop中直接使用,因为这些数据会被框架在机器和机器之间进行网络传送,也就是说,数据需要频繁的序列化和反序列化,而java原生的序列化和反序列化机制非常的臃肿,所以hadoop开发了一个自己的序列化机制。
Long—LongWritable
String—Text //导包要对,hadoop.io
Integer—IntWritable
public class PvMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
/**
* MR框架提供的程序每读一行数据就调用一次我们写的这个map方法
*/
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//拿到一行
String line = value.toString();
String[] split = line.split(" ");
//切出ip地址
String ip = split[0];
//通过context返回结果
context.write(new Text(ip), new IntWritable(1));
}
}
Reduce端:
KEYIN和VALUEIN 对应的是map阶段输出的数据的key和value的类型
KEYOUT和VALUEOUT是用户的reduce阶段的逻辑处理结果中的key和value的类型
public class PvReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
/**
* MR框架提供的reduce端在整理好一组相同key的数据后,调用reduce方法
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for(IntWritable value:values){
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
Job端:
JobSubmitter类其实是一个yarn的客户端。功能就是:将我们的mapreduce程序jar包提交给yarn,让yarn再去将jar包分发到很多的nodemanager上去执行
public class JobSubmitter {
public static void main(String[] args) throws Exception {
//新建一个job,封装任务的信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJar("/root/pv.jar");
job.setMapperClass(PvMapper.class);
job.setReducerClass(PvReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//用一个什么样的组件去读,导包lib.input
job.setInputFormatClass(TextInputFormat.class);
//告诉组件去哪读
FileInputFormat.setInputPaths(job, new Path(""));
job.setOutputFormatClass(TextOutputFormat.class);
//告诉组件结果写到哪
FileOutputFormat.setOutputPath(job, new Path(""));
//向yarn提交job,提交到nodemaneger中去执行任务
//传true,会在客户端打印集群运行的进度信息
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
第二个程序
统计单词的个数。
Map端:
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
}
Reduce端:
public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
Job端:
public class WcJobSubmitter {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJar("/root/pv.jar");
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.setInputPaths(job, new Path("/wc/input"));
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/wc/output"));
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
1、 将工程打包成jar,上传到hadoop主机上。
2、 在hdfs上创建文件存放的目录。hadoop fs -mkdir -p /wc/input。
3、hadoop fs -put qingshu.txt /wc/input 将要统计的文件上传到hdfs指定的目录上。
4、执行job。hadoop jar pv.jar cn.jixiang.mr.wc.WcJobSubmitter
5、查看结果。hadoop fs -ls /wc/output hadoop fs -cat /wc/output/part-r-00000
补充:
//设置reduce的个数
job.setNumReduceTasks(4);
第三个程序
hadoop中的数据会频繁的实现序列化和反序列化
所以自定义的类型:FlowBean必须要实现hadoop序列化接口
Map端:
public class FlowSumMapper extends Mapper<LongWritable, Text, Text, FlowBean>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] split = line.split("\t");
String phone = split[1].trim();
long upflow = Long.parseLong(split[split.length-3]);
long downflow = Long.parseLong(split[split.length-2]);
FlowBean flowBean = new FlowBean(upflow, downflow);
context.write(new Text(phone), flowBean);
}
}
Reduce端:
public class FlowSumReducer extends Reducer<Text, FlowBean, Text, FlowBean>{
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context)
throws IOException, InterruptedException {
long upflowSum = 0;
long downfolwSum = 0;
for (FlowBean flowBean : values) {
upflowSum += flowBean.getUpflow();
downfolwSum += flowBean.getDownflow();
}
context.write(key, new FlowBean(upflowSum,downfolwSum));
}
}
Bean端:
public class FlowBean implements Writable{
private long upflow;
private long downflow;
private long sumflow;
//注意显示定义一个空参构造函数
public FlowBean() {
super();
}
public FlowBean(long upflow, long downflow) {
super();
this.upflow = upflow;
this.downflow = downflow;
this.sumflow = upflow+downflow;
}
public long getUpflow() {
return upflow;
}
public void setUpflow(long upflow) {
this.upflow = upflow;
}
public long getDownflow() {
return downflow;
}
public void setDownflow(long downflow) {
this.downflow = downflow;
}
public long getSumflow() {
return sumflow;
}
public void setSumflow(long sumflow) {
this.sumflow = sumflow;
}
/**
* hadoop序列化框架在 反序列化时调用的方法
* @throws IOException
*/
public void readFields(DataInput in) throws IOException {
this.upflow = in.readLong();
this.downflow = in.readLong();
this.sumflow = in.readLong();
}
/**
* hadoop序列化框架在序列化时调用的方法
* @throws IOException
*/
public void write(DataOutput out) throws IOException {
out.writeLong(upflow);
out.writeLong(downflow);
out.writeLong(sumflow);
}
@Override
public String toString() {
return upflow+"\t"+downflow+"\t"+sumflow;
}
}
Job端:
public class FlowSumJobSubmit {
public static void main(String[] args) throws Exception {
//新建一个job,封装任务的信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//job.setJar("/root/pv.jar");
job.setJarByClass(FlowSumJobSubmit.class);
job.setMapperClass(FlowSumMapper.class);
job.setReducerClass(FlowSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//用一个什么样的组件去读,导包lib.input
job.setInputFormatClass(TextInputFormat.class);
//告诉组件去哪读
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setOutputFormatClass(TextOutputFormat.class);
//告诉组件结果写到哪
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//向yarn提交job,提交到nodemaneger中去执行任务
//传true,会在客户端打印集群运行的进度信息
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
第四个程序
(在第三个程序的基础上改进,实现分省流量统计)
将相同省的数据放到同一个reduce中。之前分数据的方法是key的hashcode%reducetasknum,所以改变它的规则。
public class ProvincePartitioner extends Partitioner<Text, FlowBean>{
private static HashMap<String,Integer> provinceCode = new HashMap<>();
static{
provinceCode.put("135", 0);
provinceCode.put("136", 1);
provinceCode.put("137", 2);
provinceCode.put("138", 3);
}
@Override
public int getPartition(Text key, FlowBean value, int numPartitions) {
Integer code = provinceCode.get(key.toString().substring(0,3));
return code==null?4:code;
}
}
重写job方法
public class JobSubmitter {
public static void main(String[] args) throws Exception {
//新建一个job,封装任务的信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJar("/root/pv.jar");
job.setMapperClass(PvMapper.class);
job.setReducerClass(PvReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//用一个什么样的组件去读,导包lib.input
job.setInputFormatClass(TextInputFormat.class);
//告诉组件去哪读
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setOutputFormatClass(TextOutputFormat.class);
//告诉组件结果写到哪
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setPartitionerClass(ProvincePartitioner.class);
job.setNumReduceTasks(Integer.parseInt(args[2]));
//向yarn提交job,提交到nodemaneger中去执行任务
//传true,会在客户端打印集群运行的进度信息
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
第五个程序
(统计每个单词,对应的文件和个数)
先以:单词-文件名作为key,以单词的次数作为value
再以:单词作为key,以文件名-次数作为value
步骤一:
public class IndexStepOne {
public static class IndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
String fileName;
/**
* 一个map程序运行实例在调用我们的自定义mapper逻辑类时,首先会调用一次setup方法,只会调用一次
*/
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
//想生成: key:单词-文件名 value:1
FileSplit inputSplit = (FileSplit) context.getInputSplit();
fileName = inputSplit.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
context.write(new Text(word+"-"+fileName), new IntWritable(1));
}
}
/**
* 当一个map程序实例在处理完自己负责的整个切片数据后,会调用一次cleanup方法。
*/
@Override
protected void cleanup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
}
}
public static class IndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
public static void main(String[] args) throws Exception {
//新建一个job,封装任务的信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(IndexStepOne.class);
job.setMapperClass(IndexStepOneMapper.class);
job.setReducerClass(IndexStepOneReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//用一个什么样的组件去读,导包lib.input
job.setInputFormatClass(TextInputFormat.class);
//告诉组件去哪读
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setOutputFormatClass(TextOutputFormat.class);
//告诉组件结果写到哪
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//重置它的分区方式
//job.setPartitionerClass(ProvincePartitioner.class);
job.setNumReduceTasks(Integer.parseInt(args[2]));
//向yarn提交job,提交到nodemaneger中去执行任务
//传true,会在客户端打印集群运行的进度信息
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
步骤二:
public class IndexStepSecond {
public static class IndexStepSecondMapper extends Mapper<LongWritable, Text, Text, Text>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] split = line.split("-");
String word = split[0];
String[] temp = split[1].split("\t");
String fileName = temp[0];
String count = temp[1];
context.write(new Text(word), new Text(fileName+"-->"+count));
}
}
public static class IndexStepSecondReducer extends Reducer<Text, Text, Text, Text>{
@Override
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
StringBuilder sb = new StringBuilder();
for (Text value : values) {
sb.append(value.toString()).append(" ");
}
context.write(key, new Text(sb.toString()));
}
}
public static void main(String[] args) throws Exception {
//新建一个job,封装任务的信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(IndexStepSecond.class);
job.setMapperClass(IndexStepSecondMapper.class);
job.setReducerClass(IndexStepSecondReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//如果map阶段输出的kv类型和reduce阶段输出的kv类型完全一致,上面两行可以不写
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//用一个什么样的组件去读,导包lib.input
//job.setInputFormatClass(TextInputFormat.class);
//告诉组件去哪读
FileInputFormat.setInputPaths(job, new Path(args[0]));
//job.setOutputFormatClass(TextOutputFormat.class);
//告诉组件结果写到哪
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//重置它的分区方式
//job.setPartitionerClass(ProvincePartitioner.class);
job.setNumReduceTasks(Integer.parseInt(args[2]));
//向yarn提交job,提交到nodemaneger中去执行任务
//传true,会在客户端打印集群运行的进度信息
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}