Hadoop Day 03

最新推荐文章于 2023-11-17 15:28:02 发布

好死不如赖活着呀

最新推荐文章于 2023-11-17 15:28:02 发布

阅读量169

点赞数

文章标签： hadoop

本文链接：https://blog.csdn.net/qq_28728655/article/details/115631401

版权

1. MapReduce 介绍
MapReduce思想在生活中处处可见。或多或少都曾接触过这种思想。MapReduce的思
想核心是“分而治之”，适用于大量复杂的任务处理场景（大规模数据处理场景）。
Map负责“分”，即把复杂的任务分解为若干个“简单的任务”来并行处理。可以进行拆
分的前提是这些小任务可以并行计算，彼此间几乎没有依赖关系。
Reduce负责“合”，即对map阶段的结果进行全局汇总。
MapReduce运行在yarn集群

ResourceManager
NodeManager
这两个阶段合起来正是MapReduce思想的体现。

1.1. MapReduce 设计构思和框架结构
MapReduce是一个分布式运算程序的编程框架，核心功能是将用户编写的业务逻辑代码和自带默认组件整合成一个完整的分布式运算程序，并发运行在Hadoop集群上。
既然是做计算的框架，那么表现形式就是有个输入（input），MapReduce操作这个输入（input），通过本身定义好的计算模型，得到一个输出（output）。

Hadoop MapReduce构思:

分而治之

对相互间不具有计算依赖关系的大数据，实现并行最自然的办法就是采取分而治
之的策略。并行计算的第一个重要问题是如何划分计算任务或者计算数据以便对
划分的子任务或数据块同时进行计算。不可分拆的计算任务或相互间有依赖关系
的数据无法进行并行计算！
统一构架，隐藏系统层细节
如何提供统一的计算框架，如果没有统一封装底层细节，那么程序员则需要
考虑诸如数据存储、划分、分发、结果收集、错误恢复等诸多细节；为此，
MapReduce设计并提供了统一的计算框架，为程序员隐藏了绝大多数系统
层面的处理细节。

MapReduce最大的亮点在于通过抽象模型和计算框架把需要做什么(what
need to do)与具体怎么做(how to do)分开了，为程序员提供一个抽象和高
层的编程接口和框架。程序员仅需要关心其应用层的具体计算问题，仅需编
写少量的处理应用本身计算问题的程序代码。如何具体完成这个并行计算任
务所相关的诸多系统层细节被隐藏起来,交给计算框架去处理：从分布代码的
执行，到大到数千小到单个节点集群的自动调度使用。
构建抽象模型：Map和Reduce
MapReduce借鉴了函数式语言中的思想，用Map和Reduce两个函数提供了高层
的并行编程抽象模型
Map: 对一组数据元素进行某种重复式的处理；
Reduce: 对Map的中间结果进行某种进一步的结果整理。
Map和Reduce为程序员提供了一个清晰的操作接口抽象描述。MapReduce
处理的数据类型是键值对。
MapReduce中定义了如下的Map和Reduce两个抽象的编程接口，由用户去编程
实现:
Map: (k1; v1) → [(k2; v2)]
Reduce: (k2; [v2]) → [(k3; v3)]

MapReduce 框架结构
一个完整的mapreduce程序在分布式运行时有三类实例进程：

MRAppMaster 负责整个程序的过程调度及状态协调
MapTask 负责map阶段的整个数据处理流程
ReduceTask 负责reduce阶段的整个数据处理流程

在这里插入图片描述

2. MapReduce 编程规范
MapReduce 的开发一共有八个步骤, 其中 Map 阶段分为 2 个步骤，Shuffle 阶段 4个步骤，Reduce 阶段分为 2 个步骤
Map 阶段 2 个步骤

设置 InputFormat 类, 将数据切分为 Key-Value(K1和V1) 对, 输入到第二步
自定义 Map 逻辑, 将第一步的结果转换成另外的 Key-Value（K2和V2）对, 输出结果
** Shuffle 阶段 4 个步骤**
对输出的 Key-Value 对进行分区（数据量过大‘即单词量大于5的一个区，小于5的一个区’）
对不同分区的数据按照相同的 Key 排序（排顺序）
(可选) 对分组过的数据初步规约, 降低数据的网络拷贝（map阶段把reduce的一些工作做了，降低reduce的压力）
对数据进行分组（相同 Key 的 Value 放入一个集合中）
Reduce 阶段 2 个步骤
对多个 Map 任务的结果进行排序以及合并, 编写 Reduce 函数实现自己的逻辑, 对输入的 Key-Value 进行处理, 转为新的 Key-Value（K3和V3）输出
设置 OutputFormat 处理并保存 Reduce 输出的 Key-Value 数据

<packaging>jar</pachaging>
<repositories>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>
<dependencies>
  <dependency>
      <groupId>jdk.tools</groupId>
      <artifactId>jdk.tools</artifactId>
      <version>1.8</version>
      <scope>system</scope>
      <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
  </dependency>
  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>3.0.0</version>
      <scope>provided</scope>
  </dependency>
  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>3.0.0</version>
  </dependency>
  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs-client</artifactId>
      <version>3.0.0</version>
      <scope>provided</scope>
  </dependency>
  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>3.0.0</version>
  </dependency>
  <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
  </dependency>
</dependencies>

在这里插入图片描述

wordcount
最后K3,V3通过TextOutputStream输出结果文件。
在这里插入图片描述
Step 2. Mapper

//import org.aoache.hadoop.mapreduce.Mapper;
/*
Mapper的泛型：
  KEYIN:k1的类型  行偏移量  LongWritable
  VALUEIN:v1的类型 一行的文本数据 Text
  KEYOUT:k2的类型  每个单词 Text
  VALUEOUT:v2的类型 固定值 1

*/
//	LongWritable封装的是Long，Text封装的是字节数组byte（因为hadoop经常要做类型，对象的序列化，觉得java的long和string代码太臃肿，所以专门设计自己的类型用来做对象的序列化）
public class WordCountMapper extends
Mapper<LongWritable,Text,Text,LongWritable> {
	@Override
	/*
	map方法是将k1，v1转成k2，v2
	key ：k1 
	value v1
    context：表示MAPReduce上下文对象，实现连接mapreduce的一个桥梁
    */
    
/*
k1     v1
0    hello,world
11   hello,hadoop
---------------------

k2      v2
hello    1
world    1
hello    1
hadoop   1
*/
	public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
		//对每一行的数据进行字符串拆分
		String line = value.toString();
		String[] split = line.split(",");
		//遍历数组，获取每一个单词
		for (String word : split) {
			context.write(new Text(word),new LongWritable(1));
		}
	}
}

Step 3. Reducer

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/*
KEYIN: K2  Text  每个单词
VALUEIN:v2  LongWriteable  集合中泛型的类型
KEYOUT: K3  Text  每个单词
VALUEOUT ： v3  LongWritable  每个单词出现的次数

*/
public class WordCountReducer extends
Reducer<Text,LongWritable,Text,LongWritable> {
	/**
	* 自定义我们的reduce逻辑
	* 所有的key都是我们的单词，所有的values都是我们单词出现的次数
	* @param key
	* @param values
	* @param context
	* @throws IOException
	* @throws InterruptedException
	*/
	
/*
	reduce方法的作用是将k2,v2--->k3,v3
	key :  k2
	values:集合
	context：MapReduce的上下文对象
*/

/*
新   k2         v2
   hello        <1,1>
   world		<1,1>
   hadoop       <1,1,1>
  ----------------------------
     k3          v3
     hello        2
     world        2
     hadoop       3
*/
	@Override
	protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
		long count = 0;
		//1、遍历values集合
		for (LongWritable value : values) {
		//2、将集合中的值相加
			count += value.get();
		}
		//3、将k3，v3写入上下文之中
		context.write(key,new LongWritable(count));
	}
}

//TextOutputFormat类自动将结果写入到文件之中

Step 4. 定义主类, 描述 Job 并提交 Job

public class JobMain extends Configured implements Tool {
	@Override
	public int run(String[] args) throws Exception {
		//创建一个任务对象
		//super.getConf()用来获取下面的configuration对象
		Job job = Job.getInstance(super.getConf(),
"mapreduce-wordcount");
		//打包到集群上面运行时候，必须要添加以下配置，指定程序的main函数
		job.setJarByClass(JobMain.class);
		//第一步：设置读取文件的类，读取输入文件解析成key，value对
		job.setInputFormatClass(TextInputFormat.class);
		TextInputFormat.addInputPath(job,new
Path("hdfs://192.168.52.250:8020/wordcount"));

		//第二步：设置我们的mapper类
		job.setMapperClass(WordCountMapper.class);
		//设置我们map阶段完成之后的输出类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);
		
		//第三步，第四步，第五步，第六步，省略（分区，排序，规约，分组）
		
		//第七步：设置我们的reduce类
		job.setReducerClass(WordCountReducer.class);
		//设置我们reduce阶段完成之后的输出类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		//第八步：设置输出类以及输出路径
		job.setOutputFormatClass(TextOutputFormat.class);
		TextOutputFormat.setOutputPath(job,new
Path("hdfs://192.168.52.250:8020/wordcount_out"));
		boolean b = job.waitForCompletion(true);
		//该返回值会返回给下面的run
		return b?0:1;
	}
	/**
	* 程序main函数的入口类
	* @param args
	* @throws Exception
	*/
	public static void main(String[] args) throws Exception {
		Configuration configuration = new Configuration();
		//启动一个任务
		Tool tool = new JobMain();
		int run = ToolRunner.run(configuration, tool, args);//返回值表示任务执行情况，返回0执行成功，否则失败，最后的args可以直接用上面的main方法的args作为输入
		System.exit(run);
	}
}

打包过程：

1.先点击test然后，再点上方红框，取消test过程。
双击package，就开始打包在这里插入图片描述

2、打包成功在这里插入图片描述

3、然后代码project多了一个target目录，.jar包文件便是成功的打包文件。
在这里插入图片描述
4、查看jobmain的全路径

得到全路径–>cn.itcast.mapreduce.JobMain
5、在集群上运行代码
hadoop jar xxxx.jar cn.itcast.mapreduce.JobMain

4. MapReduce 运行模式
集群运行模式

将 MapReduce 程序提交给 Yarn 集群, 分发到很多的节点上并发执行
处理的数据和输出结果应该位于 HDFS 文件系统
提交集群的实现步骤: 将程序打成JAR包，然后在集群的任意一个节点上用hadoop命令启动

5. MapReduce 分区

在 MapReduce 中, 通过我们指定分区, 会将同一个分区的数据发送到同一个 Reduce 当中进行处理.
例如: 为了数据的统计, 可以把一批类似的数据发送到同一个 Reduce 当中, 在同一个Reduce 当中统计相同类型的数据, 就可以实现类似的数据分区和统计等其实就是相同类型的数据, 有共性的数据, 送到一起去处理Reduce 当中默认的分区只有一个
在这里插入图片描述

自定义 Partitioner
主要的逻辑就在这里, 这也是这个案例的意义, 通过 Partitioner 将数据分发给不同的Reducer

在这里插入图片描述
右边不同分区的输出结果分别放到不同文件中

/**
* 这里的输入类型与我们map阶段的输出类型相同
*/
public class PartitionerOwn extends Partitioner<Text,LongWritable>{
	/**
	* 返回值表示我们的数据要去到哪个分区
	* 返回值只是一个分区的标记，标记所有相同的数据去到指定的分区
	* text: K2
	* LongWritable:v2
	* i:reduce个数
	*/
	@Override
	public int getPartition(Text text, NullWritable nullWritable, int i)
{
//如果单词长度>=5，进入第一个分区---->第一个reduceTask-->reduce的编号是0
		if(text.toString().length >= 5){
			return 0;
		}else{
		//如果单词长度<5，进入第二个分区--->第二个reduceTask--->reduce的编号是1
		return 1;
		}
	}
}

Step 4. Main 入口

public class JobMain extends Configured implements Tool {
	@Override
	public int run(String[] args) throws Exception {
		//创建一个任务对象
		//super.getConf()用来获取下面的configuration对象
		Job job = Job.getInstance(super.getConf(),
"mapreduce-wordcount");
		//打包到集群上面运行时候，必须要添加以下配置，指定程序的main函数
		job.setJarByClass(JobMain.class);
		//第一步：设置读取文件的类，读取输入文件解析成key，value对
		job.setInputFormatClass(TextInputFormat.class);
		TextInputFormat.addInputPath(job,new
Path("hdfs://192.168.52.250:8020/wordcount"));

		//第二步：设置我们的mapper类
		job.setMapperClass(WordCountMapper.class);
		//设置我们map阶段完成之后的输出类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);
		
		//第三步，第四步，第五步，第六步，省略（分区，排序，规约，分组）
		//****添加分区的类****
		job.setPartitionerClass(PatitionerOwn.class)
		
		//第七步：设置我们的reduce类
		job.setReducerClass(WordCountReducer.class);
		//设置我们reduce阶段完成之后的输出类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		//****设置Reduce的个数****
		job.setNumReduceTasks(2)		

		//第八步：设置输出类以及输出路径
		job.setOutputFormatClass(TextOutputFormat.class);
		TextOutputFormat.setOutputPath(job,new
Path("hdfs://192.168.52.250:8020/wordcount_out"));
		boolean b = job.waitForCompletion(true);
		//该返回值会返回给下面的run
		return b?0:1;
	}
	/**
	* 程序main函数的入口类
	* @param args
	* @throws Exception
	*/
	public static void main(String[] args) throws Exception {
		Configuration configuration = new Configuration();
		//启动一个任务
		Tool tool = new JobMain();
		int run = ToolRunner.run(configuration, tool, args);//返回值表示任务执行情况，返回0执行成功，否则失败，最后的args可以直接用上面的main方法的args作为输入
		System.exit(run);
	}
}

在这里插入图片描述
出现了两个文件，每一个文件代表字符长度不同的文件。
part-0

part-1

6.MapReduce序列化和排序
(序列化：将一个对象转成一个字节流，把对象保存在磁盘上或者通过网络的方式进行传输。
反序列化：我们从磁盘中读取对象，或者从网络中接收一个对象。就得进行反序列化。)
序列化 (Serialization) 是指把结构化对象转化为字节流

反序列化 (Deserialization) 是序列化的逆过程. 把字节流转为结构化对象. 当要在进程间传递对象或持久化对象的时候, 就需要序列化对象成字节流, 反之当要将接收到或从磁盘读取的字节流转换为对象, 就要进行反序列化。

Java 的序列化 (Serializable) 是一个重量级序列化框架, 一个对象被序列化后, 会附带很多额外的信息 (各种校验信息, header, 继承体系等）, 不便于在网络中高效传输. 所以, Hadoop 自己开发了一套序列化机制(Writable), 精简高效. 不用像 Java 对象类一样传输多层的父子关系, 需要哪个属性就传输哪个属性值, 大大的减少网络传输的开销。

Writable 是 Hadoop 的序列化格式, Hadoop 定义了这样一个 Writable 接口. 一个类要支持可序列化只需实现这个接口即可。

另外 Writable 有一个子接口是WritableComparable,WritableComparable 是既可实现序列化, 也可以对key进行比较, 我们这里可以通过自定义 Key 实现WritableComparable 来实现我们的排序功能.
在这里插入图片描述

Step 1. 自定义类型和比较器

public class PairWritable implements WritableComparable<PairWritable> {
	// 组合key,第一部分是我们第一列，第二部分是我们第二列
	private String first;
	private int second;
	public PairWritable() {
	}
	public PairWritable(String first, int second) {
		this.set(first, second);
	}
	/**
	* 方便设置字段
	*/
	public void set(String first, int second) {
		this.first = first;
		this.second = second;
	}
	/**
	* 反序列化
	*/
	@Override
	public void readFields(DataInput input) throws IOException {
		this.first = input.readUTF();
		this.second = input.readInt();
	}
	/**
	* 序列化
	*/
	@Override
	public void write(DataOutput output) throws IOException {
		output.writeUTF(first);
		output.writeInt(second);
	}
	/*
	* 重写比较器，实现排序规则
	*/
	public int compareTo(PairWritable o) {
		//每次比较都是调用该方法的对象与传递的参数进行比较，说白了就是第一行与第
二行比较完了之后的结果与第三行比较，
		//得出来的结果再去与第四行比较，依次类推
		System.out.println(o.toString());
		System.out.println(this.toString());
		int comp = this.first.compareTo(o.first);
		if (comp != 0) {
			return comp;
		} else { // 若第一个字段相等，则比较第二个字段
			return Integer.valueOf(this.second).compareTo(
		Integer.valueOf(o.getSecond()));
			}
		}
		public int getSecond() {
			return second;
		}
		public void setSecond(int second) {
			this.second = second;
		}
		public String getFirst() {
			return first;
		}
		public void setFirst(String first) {
			this.first = first;
		}
		@Override
		public String toString() {
			return "PairWritable{" +
			"first='" + first + '\'' +
			", second=" + second +
			'}';
	}
}

Step 2. Mapper

public class SortMapper extends
Mapper<LongWritable,Text,PairWritable,IntWritable> {
	private PairWritable mapOutKey = new PairWritable();
	private IntWritable mapOutValue = new IntWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
	String lineValue = value.toString();
	String[] strs = lineValue.split("\t");
	//设置组合key和value ==> <(key,value),value>
	mapOutKey.set(strs[0], Integer.valueOf(strs[1]));
	mapOutValue.set(Integer.valueOf(strs[1]));
	context.write(mapOutKey, mapOutValue);
	}
}

Step 3. Reducer

public class SortReducer extends
Reducer<PairWritable,IntWritable,Text,IntWritable> {
	private Text outPutKey = new Text();
	@Override
	public void reduce(PairWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
//迭代输出
	for(IntWritable value : values) {
	outPutKey.set(key.getFirst());
	context.write(outPutKey, value);
		}
	}
}

Step 4. Main 入口

public class SecondarySort extends Configured implements Tool {
	@Override
	public int run(String[] args) throws Exception {
	Configuration conf = super.getConf();
	conf.set("mapreduce.framework.name","local");
	Job job = Job.getInstance(conf,
SecondarySort.class.getSimpleName());
	job.setJarByClass(SecondarySort.class);
	job.setInputFormatClass(TextInputFormat.class);
	TextInputFormat.addInputPath(job,new Path("file:///L:\\大数据离线
阶段备课教案以及资料文档——by老王\\4、大数据离线第四天\\排序\\input"));
	TextOutputFormat.setOutputPath(job,new Path("file:///L:\\大数据离
线阶段备课教案以及资料文档——by老王\\4、大数据离线第四天\\排序\\output"));
	job.setMapperClass(SortMapper.class);
	job.setMapOutputKeyClass(PairWritable.class);
	job.setMapOutputValueClass(IntWritable.class);
	job.setReducerClass(SortReducer.class);
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(IntWritable.class);
	boolean b = job.waitForCompletion(true);
	return b?0:1;
}
	public static void main(String[] args) throws Exception {
		Configuration entries = new Configuration();
		ToolRunner.run(entries,new SecondarySort(),args);
	}
}