MapReduce

最新推荐文章于 2019-12-13 13:10:45 发布

做一个勤劳的码农

最新推荐文章于 2019-12-13 13:10:45 发布

阅读量188

点赞数 1

分类专栏：大数据文章标签：大数据 MapReduce Yarn

本文链接：https://blog.csdn.net/baidu_41766416/article/details/86025328

版权

大数据专栏收录该内容

15 篇文章 0 订阅

订阅专栏

YARN的体系结构

主从结构：ResourceManager、NodeManager

调度MapReduce任务过程

客户端请求执行任务连接ResourceManager去创建任务id（JobClient.java）
Yarn的ResourceManager接受请求分配资源和任务
将任务文件保存到HDFS（JobClient.java）获取元信息（数据和任务的元信息），提交任务
ResourceManager初始化任务，（谁来执行，多少资源）
分配任务和资源
NodeManager 得到元数据，访问HDFS，获取数据和任务，并执行

资源分配的方式（3种）

FIFO Scheduler : 先来先得，缺点：没有考虑任务的优先级
Capacity Scheduler：容器调度
Fair Scheduler：公平调度（注意：安装配置Hive on Spark,需要配置Yarn为Fair Scheduler）前提：假设每个任务具有相同的优先级，平均分配系统的资源，若任务优先级不一致，会引入任务权重，根据权重来分配系统资源

MapReduce编程基础

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	@Override
	protected void map(LongWritable key1, Text value1, Context context)
			throws IOException, InterruptedException {
		//数据： I love Beijing
		String data = value1.toString();
		//分词
		String[] words = data.split(" ");
		//输出 k2    v2
		for(String w:words){
			context.write(new Text(w), new IntWritable(1));
		}
	}
}
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	@Override
	protected void reduce(Text k3, Iterable<IntWritable> v3,Context context) throws IOException, InterruptedException {
		//对v3求和
		int total = 0;
		for(IntWritable v:v3){
			total += v.get();
		}
		//输出   k4 单词   v4  频率
		context.write(k3, new IntWritable(total));
	}

}
public class WordCountMain {
	public static void main(String[] args) throws Exception {
		// 创建一个job和任务入口
		Job job = Job.getInstance(new Configuration());
		job.setJarByClass(WordCountMain.class);  //main方法所在的class
		//指定job的mapper和输出的类型<k2 v2>
		job.setMapperClass(WordCountMapper.class);
		job.setMapOutputKeyClass(Text.class);    //k2的类型
		job.setMapOutputValueClass(IntWritable.class);  //v2的类型
		//指定job的reducer和输出的类型<k4  v4>
		job.setReducerClass(WordCountReducer.class);
		job.setOutputKeyClass(Text.class);  //k4的类型
		job.setOutputValueClass(IntWritable.class);  //v4的类型
		//指定job的输入和输出
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		//执行job
		job.waitForCompletion(true);
	}

}

maptask并行度与决定机制

一个job任务map阶段并行度由客户端提交的任务所决定
每一个split分配一个maptask并行处理
默认情况下，split大小=blocksize
切片是针对每个文件单独切片

MapReduce的特性

序列化：接口Writable

一个类实现了这个接口，该类的对象就可以作为key value(Map和Reduce的输入和输出)

注意：序列化的顺序，一定要跟反序列的顺序一样

public class Employee implements Writable{
	private int empno;
	private String ename;
	private String job;
	private int mgr;
	private String hiredate;
	private int sal;
	private int comm;
	private int deptno;
	@Override
	public void readFields(DataInput input) throws IOException {
		// 反序列化
		this.empno = input.readInt();
		this.ename = input.readUTF();
		this.job = input.readUTF();
		this.mgr = input.readInt();
		this.hiredate = input.readUTF();
		this.sal = input.readInt();
		this.comm = input.readInt();
		this.deptno = input.readInt();
	}
	@Override
	public void write(DataOutput output) throws IOException {
		// 序列化
		output.writeInt(this.empno);
		output.writeUTF(this.ename);
		output.writeUTF(this.job);
		output.writeInt(this.mgr);
		output.writeUTF(this.hiredate);
		output.writeInt(this.sal);
		output.writeInt(this.comm);
		output.writeInt(this.deptno);
	}	
    //set get 方法
}

排序

MapReducemot默认会进行排序，按照字典顺序

对象的排序：实现一个接口：WritableComparable

public class Employee implements WritableComparable<Employee>{
	private int empno;
	private String ename;
	private String job;
	private int mgr;
	private String hiredate;
	private int sal;
	private int comm;
	private int deptno;
	
	@Override
	public String toString() {
		return "Employee [empno=" + empno + ", ename=" + ename + ", sal=" + sal + ", deptno=" + deptno + "]";
	}
	
	@Override
	public int compareTo(Employee o) {
		// 多个列的排序：select * from emp order by deptno,sal;
		//首先、按照deptno排序
		if(this.deptno > o.getDeptno()){
			return 1;
		}else if(this.deptno < o.getDeptno()){
			return -1;
		}
		//如果deptno相等，按照sal排序
		if(this.sal >= o.getSal()){
			return 1;
		}else{
			return -1;
		}
	}
    //.........
}

分区

默认情况下，MapReduce只有一个分区

如果自定义分区，根据Map的输出<k2,v2>来建立分区继承partioner 重写getPartition

public class MyEmployeeParitioner extends Partitioner<IntWritable, Employee>{
	/**
	 * numPartition参数：建立多少个分区
	 */
	@Override
	public int getPartition(IntWritable k2, Employee v2, int numPartition) {
		// 如何建立分区
		if(v2.getDeptno() == 10){
			//放入1号分区中
			return 1%numPartition;
		}else if(v2.getDeptno() == 20){
			//放入2号分区中
			return 2%numPartition;
		}else{
			//放入0号分区中
			return 3%numPartition;
		}
	}
}
public class PartEmployeeMain {
	public static void main(String[] args) throws Exception {
		//  创建一个job
		Job job = Job.getInstance(new Configuration());
		job.setJarByClass(PartEmployeeMain.class);
		//指定job的mapper和输出的类型   k2  v2
		job.setMapperClass(PartEmployeeMapper.class);
		job.setMapOutputKeyClass(IntWritable.class);  //部门号
		job.setMapOutputValueClass(Employee.class);   //员工
		//指定任务的分区规则
		job.setPartitionerClass(MyEmployeeParitioner.class);
		//指定建立几个分区
		job.setNumReduceTasks(3);
		//指定job的reducer和输出的类型  k4   v4
		job.setReducerClass(PartEmployeeReducer.class);
		job.setOutputKeyClass(IntWritable.class);  //部门号
		job.setOutputValueClass(Employee.class);   //员工
		//指定job的输入和输出的路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		//执行任务
		job.waitForCompletion(true);
	}
}

合并（Combiner）

在Mapper端，先执行一次Reducer,减少Mapper输出到Reduce的数据量

例如：倒排索引求每个单词在每个文件中出现的次数

自定义输入与输出

可以重写mapper 的setUp 方法，在执行map方法前先执行setUp方法
继承 FileInputFormat 重写 RecordRead 继承 FileOutputFormat 重写 Recordwriter

//2.编写RecordReader 
public class FuncRecordReader extends RecordReader<NullWritable, BytesWritable>{
	boolean isProcess = false;
	FileSplit split;
	Configuration conf;
	BytesWritable value = new BytesWritable();
	//初始化
	@Override
	public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
		//初始化文件切片
		this.split = (FileSplit) split;
		//初始化配置信息
		conf = context.getConfiguration();
	}
	@Override
	public boolean nextKeyValue()  {
		if(!isProcess) {
			//1.根据切片的长度来创建缓冲区
			byte[] buf = new byte[(int)split.getLength()];
			FSDataInputStream fis = null;
			FileSystem fs = null;		
			try {
				//2.获取路径
				Path path = split.getPath();
				//3.根据路径获取文件系统
				fs = path.getFileSystem(conf);
				//4.拿到输入流
				fis = fs.open(path);
				//5.数据拷贝
				IOUtils.readFully(fis, buf, 0, buf.length);
				//6.拷贝缓存到最终的输出
				value.set(buf, 0, buf.length);		
			} catch (IOException e) {
				e.printStackTrace();
			}finally {
				IOUtils.closeStream(fis);
				IOUtils.closeStream(fs);
			}
			isProcess = true;
			return true;
		}
		return false;
	}
	@Override
	public NullWritable getCurrentKey() throws IOException, InterruptedException {
		
		return NullWritable.get();
	}
	@Override
	public BytesWritable getCurrentValue() throws IOException, InterruptedException {
		
		return value;
	}
	@Override
	public float getProgress() throws IOException, InterruptedException {	
		return 0;
	}
	@Override
	public void close() throws IOException {
	}
}
//1.创建自定义inputformat 
public class FuncFileInputFormat extends FileInputFormat<NullWritable, BytesWritable>{
	@Override
	protected boolean isSplitable(JobContext context,Path filename) {
		//不切原来的文件
		return false;
	}
	
	@Override
	public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
			throws IOException, InterruptedException {
		FuncRecordReader RecordReader = new FuncRecordReader();
		return RecordReader;
	}
}
public class SequenceFileMapper extends Mapper<NullWritable, BytesWritable, Text, BytesWritable>{
	Text k = new Text();
	@Override
	protected void setup(Context context)
			throws IOException, InterruptedException {
		//1.拿到切片信息
		FileSplit split = (FileSplit)context.getInputSplit();
		//2.路径
		Path path = split.getPath();
		//3.即带路径又带名称
		k.set(path.toString());
	}
	@Override
	protected void map(NullWritable key, BytesWritable value,
			Context context)
			throws IOException, InterruptedException {
		//输出
		context.write(k, value);
	}	
}
public class SequenceDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		// 1.获取job信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		// 2.获取jar包
		job.setJarByClass(SequenceDriver.class);
		// 3.获取自定义的mapper与reducer类
		job.setMapperClass(SequenceFileMapper.class);
		job.setReducerClass(SequenceFileReducer.class);
		//设置自定义读取方式
		job.setInputFormatClass(FuncFileInputFormat.class);
		//设置默认的输出方式
		job.setOutputFormatClass(SequenceFileOutputFormat.class);
		// 4.设置map输出的数据类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(BytesWritable.class);
		// 5.设置reduce输出的数据类型（最终的数据类型）
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(BytesWritable.class);
		// 6.设置输入存在的路径与处理后的结果路径
		FileInputFormat.setInputPaths(job, new Path("c:/in1027/"));
		FileOutputFormat.setOutputPath(job, new Path("c:/out1027/"));
		// 7.提交任务
		boolean rs = job.waitForCompletion(true);
	}
}

public class FileRecordWriter extends RecordWriter<Text, NullWritable>{
	Configuration conf = null;
	FSDataOutputStream itstarlog = null;
	FSDataOutputStream otherlog = null;
	//1.定义数据输出路径
	public FileRecordWriter(TaskAttemptContext job) throws IOException {
		//获取配置信息
		conf = job.getConfiguration();
		//获取文件系统
		FileSystem fs = FileSystem.get(conf);
		//定义输出路径
		itstarlog = fs.create(new Path("c:/outitstaredu/itstar.logs"));//part-r-00000
		otherlog = fs.create(new Path("c:/outputother/other.logs"));
	}
	//2.数据输出
	@Override
	public void write(Text key, NullWritable value) throws IOException, InterruptedException {
		//判断的话根据key
		if(key.toString().contains("itstar")) {
			//写出到文件
			itstarlog.write(key.getBytes());
		}else {
			otherlog.write(key.getBytes());
		}
	}
	//3.关闭资源
	@Override
	public void close(TaskAttemptContext context) throws IOException, InterruptedException {
		if(null != itstarlog) {
			itstarlog.close();
		}
		if(null != otherlog) {
			otherlog.close();
		}
	}
}
public class FuncFileOutputFormat extends FileOutputFormat<Text, NullWritable>{

	@Override
	public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job)
			throws IOException, InterruptedException {
		
		FileRecordWriter fileRecordWriter = new FileRecordWriter(job);
		
		return fileRecordWriter;
	}
}

      //伪代码。。。。。
     // 5.设置reduce输出的数据类型（最终的数据类型）
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		//设置自定outputFormat
		job.setOutputFormatClass(FuncFileOutputFormat.class);
		// 6.设置输入存在的路径与处理后的结果路径
		FileInputFormat.setInputPaths(job, new Path("c:/in1029/"));
		FileOutputFormat.setOutputPath(job, new Path("c:/out1029/"));
		// 7.提交任务
		boolean rs = job.waitForCompletion(true);

另外在job也可以指定InputForMat:

//指定运行的inputformat方式默认的方式是textinputformat(小文件优化)
       job.setInputFormatClass(CombineTextInputFormat.class);
       CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);//最大4M
       CombineTextInputFormat.setMinInputSplitSize(job, 3145728);//最小3M

数据压缩

压缩技术能有效减少底层（hdfs）读写字节数

压缩提高了网络带宽和磁盘空间的效率，---->有效的节省资源，是一种优化策略

通过压缩编码对mapper和reduce数据传输进行数据的压缩，以减少磁盘的iO

压缩的基本原则：

运算密集型任务少用压缩逻辑复杂度高，运算复杂
io密集型任务，多用压缩

支持的压缩编码格式 ：
	压缩格式 ** hadoop是否自带 * 文件拓展名  *  是否可以切分  *  编码/解码器       压缩性能（原始文件大小|压缩后大小|压缩速度|解压速度）
	DEFAULT  **       是       *   .deflate  *     否         *  DefaultCodeC        
	Gzip     **       是       *   .gz       *     否         *  GzipCodeC           8.3GB|1.8GB |17.5MB/s |58MB/s
	bzip2    **       是       *   .bz2      *     是         *  BZip2CodeC          8.3GB|1.1GB |2.4MB/s  |9.5MB/s
	LZO      **       否       *   .lzo      *     是         *  LzoCodeC            8.3GB|2.96GB|49.3MB/s |74.6MB/s
	Snappy   **       否       *   .snappy   *     否         *  SnappyCodeC

MapReduce的核心: shuffle 洗牌

hadoop优化

1 mr程序瓶颈效率
   功能：分布式离线计算
   计算机性能
   Cpu 内存磁盘网络
   io 操作优化：

数据倾斜（代码优化）
map和reduce个数设置合理
map运行时间过长，导致reduce等待过久
小文件过多，文件进行合并（CombineTextInputFormat）
不可分块的超大文件（不断的溢写）
多个溢写小文件需要多次归并

2 mr优化方法
六个方面考虑：数据输入 map阶段 reduce阶段 IO传输数据倾斜参数优化
数据输入：

合并小文件：在执行mr任务前就进行小文件合并
采用CombineTextInputFormat来作为输入，解决输入端大量小文件的输入，mr不适合处理大量小文件

map阶段：

减少溢写次数（将100mb进行调整 80%）
减少合并次数调参：mapreduce.task.io.sort.factor 调大
进行combine

reduce阶段：

合理设置map与reduce个数
设置map/reduce 共存设置运行一定程度的map运行后，启动reduce，减少等待时间调参
合理设置reduce端的buffer

传输：

数据压缩
使用sequenceFile

数据倾斜

进行范围分区
自定义分区
combine
能用map join坚决不用reduce join

参数：

cpu map和reduce使用cpu核心数
内存 map ruduce内存使用设置调大

做一个勤劳的码农

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce

YARN的体系结构主从结构：ResourceManager、NodeManager调度MapReduce任务过程客户端请求执行任务连接ResourceManager去创建任务id（JobClient.java） Yarn的ResourceManager接受请求分配资源和任务将任务文件保存到HDFS（JobClient.java）获取元信息（数据和任务的元信息），提交任务...
复制链接

扫一扫

专栏目录