MapReduce

最新推荐文章于 2024-06-21 20:30:55 发布

20190526

最新推荐文章于 2024-06-21 20:30:55 发布

阅读量151

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_35211324/article/details/111865355

版权

大数据专栏收录该内容

9 篇文章 0 订阅

订阅专栏

MapReduce是Hadoop上的一个计算框架，计算过程分为Map阶段和Reduce阶段。用户只需要编写map()和reduce()函数即可完成简单的分布式的程序设计。
map函数以键值对为输入，产生一系列键值对作为中间输出，写入本地磁盘，MapReduce会自动将中间数据按key值聚集，将key值相同的数据统一交给reduce函数处理。
reduce函数以键值对作为输入，经过汇总计算后将数据写入HDFS

MapReduce的五个可编程组件

InputFormat通过InputFormat类定义如何分割和读取输入文件，Hadoop提供了多种输入格式，其中有一个抽象类FileInputFormat，所有操文件的InputFormat类都是从它继承功能和属性。
Mapper将输入的数据转换成特定的键值对
Partitoner确定Mapper产生的key/value发送给哪个Reducer处理
Reducer对相同或key的数据汇总计算
OutputFormat 指定输出格式

上代码：统计西游记中“悟空”出现的次数
创建一个Java maven项目，添加依赖。

	<dependency>
	    <groupId>org.apache.hadoop</groupId>
	    <artifactId>hadoop-client</artifactId>
	    <version>3.2.1</version>
	</dependency>

代码有三个类，APP类,Mapper类,Reduce类。APP类配置MapReduce作业信息，Mapper做映射，Reduce做汇总
App类

package com.yjb.hadoop;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * Hello world!
 *
 */
public class App 
{
    public static void main( String[] args ) throws IOException, ClassNotFoundException, InterruptedException
    {
    	
        Configuration conf=new Configuration();
         //集群名字，配置文件里有
        conf.set("fs.defaultFS", "hdfs://mycluster");
        FileSystem fs = FileSystem.get(conf);
		//删除文件，否则每次重新运行都会提示目录已存在
        fs.delete(new Path("/output"), true);
        //作业名
        Job job = Job.getInstance(conf,"App");
        //作业启动类
        job.setJarByClass(App.class);
        //可编程组件InputFormat，设置输入格式，默认使用TextInputFormat
        //job.setInputFormatClass(cls);
        //可编程组件Mapper
        job.setMapperClass(WordCountMapper.class);
        //可编程组件Partitioner
        //job.setPartitionerClass(cls);  
        //可编程组件Reducer
        job.setReducerClass(WordCountReduce.class);       
        //可编程组件OutputFormat，设置输出格式，默认使用TextInputFormat
        //job.setOutputFormatClass(cls);
        //设置reduce作业数量
        job.setNumReduceTasks(1);
        //输出key类型
        job.setOutputKeyClass(Text.class);
        //输出值类型
        job.setOutputValueClass(Text.class);
        //输入目录
        String in ="/input";
        //输出目录
        String out="/output";
        
        Path input = new Path(in);
        
        Path output= new Path(out);
        
        
        FileInputFormat.addInputPath(job, input);
        
        FileOutputFormat.setOutputPath(job, output);
        //成功输出0失败输出1
        System.out.println(job.waitForCompletion(true)?0:1);
    }
}

这里注意上面代码中的五个可编程组件

mapper类

package com.yjb.hadoop;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, Text>{
	@Override
	protected void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)
			throws IOException, InterruptedException {
		//默认一个value一行，统计一行里有多少"悟空"
		 String s=value.toString();
		 Pattern pattern = Pattern.compile("悟空");    
         Matcher matcher = pattern.matcher(s);    
         int count=0;  
         while(matcher.find()){  
             count++;  
         }  
		context.write(new Text("悟空"), new Text(String.valueOf(count)));	
	}

}

reduce类

package com.yjb.hadoop;

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReduce extends Reducer<Text, Text, Text, Text>{
	@Override
	protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
			throws IOException, InterruptedException {
		int c=0;
		for(Text v:values) {
			String s=v.toString();
			System.out.println("s:"+s);
			c=c+Integer.valueOf(s);
		}
		context.write(key, new Text("出现了："+c+"次"));
	}
}

将西游记文档上传到hdfs的/input/目录下，注意一定是utf8编码的文件。
导出程序包，在hadoop上运行

hadoop jar ABC.jar #我导出的jar包

在这里插入图片描述

对上面的代码做一点改进，分别统计，八戒，大师兄，猴哥，沙师弟出现的次数

package com.yjb.hadoop;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, Text> {
	@Override
	protected void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)
			throws IOException, InterruptedException {
		// 八戒，大师兄，猴哥，沙师弟出现的次数
		String s = value.toString();
		String[] p= {"八戒","大师兄","猴哥","沙师弟"};
		for(String sp:p) {
			Pattern pattern = Pattern.compile(sp);
			Matcher matcher = pattern.matcher(s);
			int count = 0;
			while (matcher.find()) {
				count++;
			}
			context.write(new Text(sp), new Text(String.valueOf(count)));
		}	
	}
}