MapReduce

最新推荐文章于 2023-11-09 21:48:15 发布

小布爱篮球

最新推荐文章于 2023-11-09 21:48:15 发布

阅读量197

点赞数

分类专栏：大数据基础文章标签： MapReduce（一）

本文链接：https://blog.csdn.net/weixin_43854923/article/details/85124755

版权

大数据基础专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一、概述：
MapReduce是Hadoop中的分布式计算框架，MapReduce意味着在计算过程中实际分为两大步：Map过程和Reduce过程。
在这里插入图片描述

map任务：
1.读取输入文件内容，解析成key、value对。对输入文件的每一行解析成key、value对。每一个键值对调用一次map函数。
2.写自己的逻辑，对输入的key、value进行处理，转换成新的key、value输出。
3.对输出的key、value进行分区。
4.对相同分区的数据，按照key进行排序（默认按字典顺序进行排序）、分组。相同的key的value放到一个集合中。
5.（可选）分组后的数据进行规约

注意：在MapReduce中，Mapper可以单独存在，但是Reducer不能单独存在。

Reduce任务
1.对多个map任务的输出，按照不同的区，通过网络copy到不同的节点。这个过程并不是map将数据发给reduce，而是reduce主动去获取数据。Reduce的数量>=分区的数量
2.对多个map任务的输出进行合并、排序。写reduce函数自己的逻辑，对输入的key、value进行处理，转换成新的key、value输出。
3.把reduce的输出保存到文件中。

MapReduce执行流程
1. run job：客户端提交一个mr的jar包给JobClient(提交方式：hadoop jar …。
1. 做job环境信息的收集，比如各个组件类，输入输出的kv类型等，检测是否合法
2. 检测输入输出的路径是否合法
2. JobClient通过RPC和ResourceManager进行通信，返回一个存放jar包的地址（HDFS）和jobId。jobID是全局唯一的，用于标识该job
3. client将jar包写入到HDFS当中(path = hdfs上的地址 + jobId)
4. 开始提交任务(任务的描述信息，不是jar, 包括jobid，jar存放的位置，配置信息等等)
5. JobTracker进行初始化任务
6. 读取HDFS上要处理的文件，开始计算输入切片，每一个切片对应一个MapperTask。注意：切片是一个对象，存储的是这个切片的数据描述信息；切块才是文件块（数据块），里面存储的才是真正的文件数据。
7. TaskTasker通过心跳机制领取任务（任务的描述信息）。切片一般和切块是一样的，即在实际开发中，切块和切片默认是相同的。在领取到任务之后，要满足数据本地化策略。
8. 下载所需的jar，配置文件等。体现的思想：移动的是运算/逻辑，而不是数据。
9. TaskTracker启动一个java child子进程，用来执行具体的任务（MapperTask或ReducerTask）
10.将结果写入到HDFS当中在这里插入图片描述

一般而言，切片的描述的大小和切块的大小是一致的，习惯上，会将namenode也作为jobtracker，将datanode作为TaskTracker

需要创建三个类，分别是：Mapper、Reducer、Driver
案列一：统计文件中每一个单词出现的次数
Mapper：

public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

	public void map(LongWritable ikey, Text ivalue, Context context) throws IOException, InterruptedException {

		String line = ivalue.toString();
		String[] arr = line.split(" ");
		for (String str : arr) {
			context.write(new Text(str), new LongWritable(1));
		}
	}
}

Reducer：

public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

	public void reduce(Text _key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
		
		long sum = 0;
		for (LongWritable val : values) {
			sum += val.get();
		}
		context.write(_key, new LongWritable(sum));
	}
}

Driver：

public class WordCountDriver {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf, "JobName");
		job.setJarByClass(cn.tedu.wc2.WordCountDriver.class);
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		// 如果mapper的结果类型和reducer的结果类型一致，可以只设置一个
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);

		FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.60.132:9000/mr/words.txt"));
		FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.60.132:9000/result2"));

		if (!job.waitForCompletion(true))
			return;
	}
}

序列化/反序列化机制
当自定义一个类之后，如果想要产生的对象在Hadoop中进行传输，那么需要这个类实现Writable的接口进行序列化/反序列化。

public class Flow implements Writable{
	private String phone;
	private String city;
	private String name;
	private int flow;

	public String getPhone() {
		return phone;
	}

	public void setPhone(String phone) {
		this.phone = phone;
	}

	public String getCity() {
		return city;
	}

	public void setCity(String city) {
		this.city = city;
	}

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public int getFlow() {
		return flow;
	}

	public void setFlow(int flow) {
		this.flow = flow;
	}

	// 反序列化
	@Override
	public void readFields(DataInput in) throws IOException {
		// 按照序列化的顺序一个一个将数据读取出来
		this.phone = in.readUTF();
		this.city = in.readUTF();
		this.name = in.readUTF();
		this.flow = in.readInt();
	}

	// 序列化
	@Override
	public void write(DataOutput out) throws IOException {
		// 按照顺序将属性一个一个的写出即可
		out.writeUTF(phone);
		out.writeUTF(city);
		out.writeUTF(name);
		out.writeInt(flow);
	}
}

分区 - Partitioner
在这里插入图片描述
        分区操作是shuffle操作中的一个重要过程，作用是将map的结果按照规则分发到不同的reduce中进行处理，从而按照分区得到多个输出结果
        Partitioner是partitioner的基类，如果需要定制partitioner也需要继承该类，HashPartitioner是MapReduce的默认partitioner。
        计算方法是：which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks注：默认情况下，reduceTask数量为1
很多时候MR自带的分区规则并不能满足我们需求，为了实现特定的效果，可以需要自己来定义分区规则。

案例：根据城市划分，来统计每一个城市每一个人产生的流量
Mapper：

public class FlowMapper extends Mapper<LongWritable, Text, Text, Flow> {

	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

		String line = value.toString();

		String[] arr = line.split(" ");

		Flow f = new Flow();
		f.setPhone(arr[0]);
		f.setCity(arr[1]);
		f.setName(arr[2]);
		f.setFlow(Integer.parseInt(arr[3]));
		
		context.write(new Text(f.getPhone()), f);
	}
}

///指定分区

public class FlowPartitioner extends Partitioner<Text, Flow> {

	@Override
	public int getPartition(Text key, Flow value, int numPartitions) {
		
		String city = value.getCity();
		
		if(city.equals("bj")){
			return 0;
		} else if(city.equals("sh"))
			return 1;
		else 
			return 2;
	}
}

Reducer：

public class FlowReducer extends Reducer<Text, Flow, Text, IntWritable> {

	public void reduce(Text key, Iterable<Flow> values, Context context) throws IOException, InterruptedException {
		
		int sum = 0;
		
		for (Flow val : values) {
			sum += val.getFlow();
		}
		context.write(key, new IntWritable(sum));
	}
}

Driver：

public class FlowDriver {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf, "JobName");
		job.setJarByClass(cn.tedu.flow2.FlowDriver.class);
		job.setMapperClass(FlowMapper.class);
		job.setReducerClass(FlowReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Flow.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		// 指定分区
		job.setPartitionerClass(FlowPartitioner.class);
		// 指定分区所对应的reducer数量
		job.setNumReduceTasks(3);

		FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.60.132:9000/mr/flow.txt"));
		FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.60.132:9000/fpresult"));

		if (!job.waitForCompletion(true))
			return;
	}
}

Combiner：
在这里插入图片描述

排序
如果想要进行排序，需要将对象实现WritableComparable<？>接口，然后将排序的对象作为mapper中的键才可以。

实现WritableComparable<？>接口

public class Profit implements WritableComparable<Profit> {

	private String name;
	private int profit;

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public int getProfit() {
		return profit;
	}

	public void setProfit(int profit) {
		this.profit = profit;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(name);
		out.writeInt(profit);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.name = in.readUTF();
		this.profit = in.readInt();

	}

	// 如果需要对结果排序，需要将排序规则写到这个方法中
	@Override
	public int compareTo(Profit o) {
		return this.profit - o.profit;
	}
}

mapper：将对象作为输出的键才可以

public class SortMapper extends Mapper<LongWritable, Text, Profit, NullWritable> {

	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		
		String line = value.toString();
		
		String[] arr = line.split("\t");
		
		Profit p = new Profit();
		p.setName(arr[0]);
		p.setProfit(Integer.parseInt(arr[1]));
		
		context.write(p, NullWritable.get());
		
	}
}

Reducer和Driver略…

小布爱篮球

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce

一、概述：&nbsp;MapReduce是Hadoop中的分布式计算框架，MapReduce意味着在计算过程中实际分为两大步：Map过程和Reduce过程。map任务：1.读取输入文件内容，解析成key、value对。对输入文件的每一行解析成key、value对。每一个键值对调用一次map函数。2.写自己的逻辑，对输入的key、value进行处理，转换成新的key、value输出。3....
复制链接

扫一扫