MapReduce计算模型——学习笔记

最新推荐文章于 2023-12-19 10:12:21 发布

phac123

最新推荐文章于 2023-12-19 10:12:21 发布

阅读量288

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/weixin_42596275/article/details/105864723

版权

Hadoop 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

MapReduce Job

每个MapReduce任务被初始化为一个Job
每个Job对应两个阶段Map和Reduce，分别对应Map函数和Reduce函数

这个过程中间是键值对的传递
在这里插入图片描述
MapReduce流程：

Mapper

作为mapper，继承
org.apache.hadoop.mapreduce.Mapper

public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

键类实现WritableComparable, 值类Writable

void setup(Context context);

map任务之前执行，可以打开数据库，数据预处理

void close()

作为map任务结束前的最后一个操作，该函数完成所有的结尾工作，如关闭数据库，关闭文件等.

void map(KEY key, VALUE value, Context context)

对输入的key1, value1, 执行map操作

void run(Context context)

执行复杂的控制，比如多线程map

Map方法*

如果我想写一个MapReduce的程序，其实就需要重新写一个map的方法，将原来的map方法给覆盖掉.
一个map用于处理一个单独的键值对

Reducer

Reducer任务接受来自各个mapper的输出时，按照键对输入数据(map输出数据)，将相同的键的值归并(shuffle), 并进行排序(sort)，然后调用reduce函数，并通过迭代处理那些与指定键相关联的值，生成一个列表<K3, V3>.
继承org.apache.hadoop.mapreduce.Reducer

protected void map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException InterruptedException

— 该函数处理一个给定的键值对(K1, V1),生成一个键值对(K2, V2)的列表。
— Context.write(key, value):输出map的计算结果
— Context可提供对Mapper相关附加信息的记录，形成任务进度
例子：#Wordcount.java#

package ex6;

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

	public static class TokenizerMapper extends 
	Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();

		//这个是收尾，可以做数据库的连接，关闭之类的操作
		@Override
		protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO 自动生成的方法存根
			super.cleanup(context);
		}

		//这个方法是支持多线程
		@Override
		public void run(Mapper<Object, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO 自动生成的方法存根
			super.run(context);
		}

		//进行配置，在进入mapper之前就会调用.
		@Override
		protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO 自动生成的方法存根
			super.setup(context);
		}

		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
			StringTokenizer itr = new StringTokenizer(value.toString(), "\t\n\r\f,.:;?![]' ");
			while (itr.hasMoreTokens()) {
				word.set(itr.nextToken());
				context.write(word, one);
			}
		}
	}

	public static class IntSumReducer extends 
	Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values, Context context)
				throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}
	}

	//Java驱动
	@SuppressWarnings("deprecation")
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if (otherArgs.length < 2) {
			System.err.println("Usage: wordcount <in> [<in>...] <out>");
			System.exit(2);
		}
		Job job = new Job(conf, "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(TokenizerMapper.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		for (int i = 0; i < otherArgs.length - 1; ++i) {
			FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
		}
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

预定义类：
Hadoop预定义的Mapper的实现
Marper<K,V>,(MR1中的IdentityMapper<K,V>)
实现Mapper<K,V,K,V>就输出直接映射到输出

InverseMapper<K,V>
实现Mapper<K,V,V,K>发转键值对，实现了key和value的交换

RegexMapper
实现Mapper<K, Text, LongWritable>,为每个常规表达式的匹配项生成一个(match, 1)队.

TokenCountMapper
实现Mapper<K, Text, Text, LongWritable>,当输入的值为分词时，生成一个(token, 1)队.

Hadoop预定义的Reducer的实现：
Reducer,(MR1中IdentityReducer<k, v>)
实现Reducer<K, V, K, V>，将输入直接映射到输出

IntSumReducer, LongSumReducer
实现<K, IntWritable, K, IntWritable>，计算与给定键相对应的所有值的和
实现<K, LongWritable, K, IntWritable>, 计算与给定键相对应的所有值的和
例子:#MRPre-Defined#

package ex6;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;


public class MRPreDefined {
    @SuppressWarnings("deprecation")
	public static void main(String[] args) throws Exception {
    	Configuration conf=new Configuration();
    	Job job = new Job(conf, "word count");
    	job.setJarByClass(MRPreDefined.class);


        FileInputFormat.setInputPaths(job, new Path("testdata/input3"));
        FileOutputFormat.setOutputPath(job, new Path("testdata/output3-3"));
        
        //test1 直接输出
//        job.setOutputKeyClass(LongWritable.class);	//输出Key的数据类型
//        job.setOutputValueClass(Text.class);			//输出Value的数据类型
//        job.setMapperClass(Mapper.class);				//预定义
//        job.setReducerClass(Reducer.class);			//预定义
//        
        
        //test2 逆转输出
//        job.setOutputKeyClass(Text.class);             //输出Key的数据类型
//        job.setOutputValueClass(LongWritable.class);   //输出Value的数据类型
//        job.setMapperClass(InverseMapper.class);    //预定义
//        job.setReducerClass(Reducer.class);    //预定义
        
        
        //test3 求和输出
        job.setOutputKeyClass(Text.class);             //输出Key的数据类型
        job.setOutputValueClass(IntWritable.class);   //输出Value的数据类型
        job.setMapperClass(TokenCounterMapper.class);    //预定义
        job.setReducerClass(IntSumReducer.class);    //预定义
        
        
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    
}