MapReduce简介
MapReduce是一个软件框架,用于轻松编写应用程序,以可靠,容错的方式在大型集群(数千个节点)的商用硬件上并行处理大量数据(多TB数据集)。
MapReduce 作业通常将输入数据集拆分为独立的块,这些块由map任务以完全并行的方式处理。框架对map的输出进行排序,然后输入到reduce任务。通常,作业的输入和输出都存储在文件系统中。该框架负责调度任务,监视任务并重新执行失败的任务。
官网地址
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
job提交图解
注:appmaster属于mapreduce框架,不属于yarn(是nodemanager中的一个程序)
nodemanager如何知道自己分配到了任务?
3秒钟心跳机制,rm将自己的分配方案记录到调度队列里放到job.plit文件里面
MapReduce图解
分片
源码:min(min分片(1),max(blocksize,max分片(long的最大值)))(所以默认分片就是分块大小)
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = new StopWatch().start();
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}
protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
maptask数量:文件数量 数据量 分片大小
分片是一个逻辑概念,分片信息包括起始偏移量,分片大小,分片数据所在的块的信息,块所在的主机列表。每一个分片对应着一个maptask,通过调整分片的大小可以调整maptask的数量,也就是调整map阶段的并行度。
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));//返回1
long maxSize = getMaxSplitSize(job);//返回long的最大值
long splitSize = computeSplitSize(blockSize, minSize, maxSize)
return Math.max(minSize, Math.min(maxSize, blockSize));
//计算分片大小
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {} //分片有一个1.1倍的冗余
分片机制
(1)分片读取规则
①第一个分片从第一行(偏移量为零)开始读取,读到分片末尾,再读取下一个分片的第一行
②既不是第一个分片也不是最后一个分片,第一行数据舍去,读到分片末尾,再继续读取下一个分片的第一行数据
③最后一个分片舍去第一行,读到分片末尾
设置分片大小
FileInputFormat.setMinInputSplitSize(job,1000);
FileInputFormat.setMaxInputSplitSize(job,1000000);
判断是否可以分片,文件若是压缩的需要判断
map读取数据源码查看
src包里的InputFormat中的FileInputFormat中的TextInputFormat中的createRecordReader(中有一个LineRecordReader中的nextKeyValue、getCurrentKey、getCurrentValue流式读取数据)
shuffle(图解)
环形缓冲区20%保留区,80%溢写磁盘 ,溢写到spiller(溢写器)
在此期间3个数组管理(kv)元数据
shuffle指的就是map端无规则的输出到reduce的输入
mr程序编写:
- 用户编写mapreduce程序分为3个部分,Mapper,Reducer,Driver(提交运行mr程序的客户端)
- Mapper的输入数据是kv键值对的形式(kv的数据类型是可以自定义)
- Mapper的输出数据是kv键值对的形式(kv的数据类型是可以自定义)
- Mapper中的业务逻辑写在map方法中
- map方法对每一个maptask进程传入的kv键值对数据调用一次
- Reduce的输入数据是kv键值对的形式(kv的数据类型是可以自定义)
- Reduce的输出数据是kv键值对的形式(kv的数据类型是可以自定义)
- Reduce中的业务逻辑写在reduce方法中
- reducetask进程对每一组key相同的kv键值对数据调用一次reduce方法
- 用户自定义的Mapper和Reducer都要继承自各自的父类
- 整个程序需要一个Driver来进行提交,提交的是一个描述了各种必要信息的job对象
举例
map
package pvcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* 框架在调研共我们写的map方法业务时,会将数据作为参数(一个key,一个value)传入到map方法中
* Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
* KEYIN:是框架(maptask)要传递给map方法的输入的参数中key的数据类型
* VALUEIN:是框架(maptask)要传递给map方法的输入的参数中key的数据类型
* 在默认情况下,框架传入的key是框架从待处理的数据文件(文本文件)中读取到的某一行起始偏移量
* 所以keyin是long类型
* 框架传入的value是从从待处理的数据文件(文本文件)中读取到的一行的数据的内容,
* 所以valuein是String类型
*
* 但是,long和String是java的原生数据类型,序列化效率低
* 所以,hadoop对其进行了改造,有一些替代品
* long/LongWritable
* Sring/Text
* int/IntWritable
*
* map方法处理数据后之后需要写出一对key,value
* keyout:map方法处理完成后,输出的结果中key的数据类型
* valueout:map方法处理完成后,输出的结果中的value数据类型
*
*/
public class PvConutMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
/**
* map方法是MR框架提供的,每读取一行数据就调用一次map方法
* 也就是说读一行数据就要映射成一个key-value,然后map方法处理,处理完成后输出一个key-value
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/**
* 由于是一行处理一次,我们不能在map阶段汇总
* 但是我们可以把数据输出成
* 216.244.66.231 1
* 140.205.201.39 1
* 140.205.201.39 1
* 220.181.108.104 1 1 1 1
*
* 然后相同的IP的数据会被分发到同一台机器上去做reduce处理,分到一组,然后将value累加再输出
* 实际上MR程序就是这么实现的
*/
//value是Text类型
String line = value.toString();
String fields [] = line.split(" ");
String ip = fields [0];
//map将数据处理完成后交还给框架写出,由框架进行shuffle
context.write(new Text(ip),new IntWritable(1));
}
}
reduce
package pvcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
* KEYIN与mapper 类输出的key的类型一致
* VALUEIN与mapper 类输出的value的类型一致
*/
public class PvCountReduce extends Reducer<Text, IntWritable,Text,IntWritable> {
/**
* 框架在reduce端整理好一组相同的key的数据后,一组数据调用一次reduce方法
* @param key
* @param values
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
//140.205.201.39 :{1,1,1,1,1,1}
//既然是一组,key相同,我们只需要把value遍历出来累加即可
//先定义一个计数器
int count = 0;
for (IntWritable value:values){
count+=value.get();
}
context.write(key,new IntWritable(count));
//现在map阶段和reduce阶段已经完成,现在就需要将程序打包交给 yarn去执行
//但是现在还不能,需要编写一个客户端程序,客户端程序要指定输入输出等,将程序提交给yarn去执行,
//yarn拿到jar包之后,将jar包分发到其他机器
}
}
driver
package pvcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
* 编写yarn的客户端程序将MR程序提交给YARN去执行。yarn将jar包分发到多个nodemanager上
* 执行是有先后顺序的,先执行map程序,再执行reduce程序
*/
public class PvCountRunner {
public static void main(String[] args) {
/**
* 程序启动需要设置一些参数,要告诉程序中的一些信息,那这些信息很零散
* 所以我们把这些信息封装到一个对象中,这个对象就是job
*/
try {
Configuration conf = new Configuration();
//获取job并携带参数
Job job = Job.getInstance(conf,"pvCount");
//可以用job对象封装一些信息
//首先程序是一个jar包,就要指定jar包的位置
//将jar包放在root目录下
//可以将这个程序打包为pv.jar,上传到linux机器上
//使用命令运行
//hadoop jar /root/pv.jar pvcount.PvCountRunner /data/pvcount /out/pvcount
//hadoop jar /root/pv.jar pvcount.CountRunner /access.log /count
job.setJar("/root/pv.jar");
/**
* 一个jar包中可能有多个job,所以我们要指明job使用的是哪儿一个map类
* 哪儿一个reduce类
*/
job.setMapperClass(PvConutMapper.class);
job.setReducerClass(PvCountReduce.class);
/**
* map端输出的数据要进行序列化,所以我们要告诉框架map端输出的数据类型
*/
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
/**
* reduce端要输出,所有也要指定数据类型
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
/**
* 告诉框架用什么组件去读数据,普通的文本文件,就用TextInputFormat
* 导入长包
*/
job.setInputFormatClass(TextInputFormat.class);
/**
* 告诉这个组件去哪儿读数据
* TextInputFormat有个父类FileInputFormat
* 用父类去指定到哪儿去读数据
* 输入路径是一个目录,该目录下如果有子目录需要进行设置递归遍历,否则会报错
*/
FileInputFormat.addInputPath(job,new Path(args[0]));
FileInputFormat.setMinInputSplitSize(job,1000);
FileInputFormat.setMaxInputSplitSize(job,1000000);
//设置写出的组件
job.setOutputFormatClass(TextOutputFormat.class);
//设置写出的路径
FileOutputFormat.setOutputPath(job,new Path(args[1]));
// FileOutputFormat.
job.setNumReduceTasks(5);
//执行
/**
* 信息设置完成,可以调用方法向yarn去提交job
* waitForCompletion方法就会将jar包提交给RM
* 然后rm可以将jar包分发,其他机器就执行
*/
//传入一个boolean类型的参数,如果是true,程序执行会返回true/flase
//如果参数传入true,集群在运行时会有进度,这个进度会在客户端打印
boolean res = job.waitForCompletion(true);
/**
* 客户端退出后会返回一个状态码,这个状态码我们可以写shell脚本的时候使用
* 根据返回的状态码的不同区执行不同的逻辑
*/
System.exit(res? 0:1);
} catch (Exception e) {
e.printStackTrace();
}
}
}
注:继承Mapper类可以重写setup方法:仅调用一次(可以用来连接数据库)
cleanup方法: maptask执行完成,仅调用一次 可以用来关闭一些资源
注意:关闭资源的时候要防止内存泄露(直接关闭了生命周期长的对象,而没有关闭该对象中引用的短生命周期的对象)
Mapper类
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;
/**
* Maps input key/value pairs to a set of intermediate key/value pairs.
*
* <p>Maps are the individual tasks which transform input records into a
* intermediate records. The transformed intermediate records need not be of
* the same type as the input records. A given input pair may map to zero or
* many output pairs.</p>
*
* <p>The Hadoop Map-Reduce framework spawns one map task for each
* {@link InputSplit} generated by the {@link InputFormat} for the job.
* <code>Mapper</code> implementations can access the {@link Configuration} for
* the job via the {@link JobContext#getConfiguration()}.
*
* <p>The framework first calls
* {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
* {@link #map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context)}
* for each key/value pair in the <code>InputSplit</code>. Finally
* {@link #cleanup(org.apache.hadoop.mapreduce.Mapper.Context)} is called.</p>
*
* <p>All intermediate values associated with a given output key are
* subsequently grouped by the framework, and passed to a {@link Reducer} to
* determine the final output. Users can control the sorting and grouping by
* specifying two key {@link RawComparator} classes.</p>
*
* <p>The <code>Mapper</code> outputs are partitioned per
* <code>Reducer</code>. Users can control which keys (and hence records) go to
* which <code>Reducer</code> by implementing a custom {@link Partitioner}.
*
* <p>Users can optionally specify a <code>combiner</code>, via
* {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
* intermediate outputs, which helps to cut down the amount of data transferred
* from the <code>Mapper</code> to the <code>Reducer</code>.
*
* <p>Applications can specify if and how the intermediate
* outputs are to be compressed and which {@link CompressionCodec}s are to be
* used via the <code>Configuration</code>.</p>
*
* <p>If the job has zero
* reduces then the output of the <code>Mapper</code> is directly written
* to the {@link OutputFormat} without sorting by keys.</p>
*
* <p>Example:</p>
* <p><blockquote><pre>
* public class TokenCounterMapper
* extends Mapper<Object, Text, Text, IntWritable>{
*
* private final static IntWritable one = new IntWritable(1);
* private Text word = new Text();
*
* public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
* StringTokenizer itr = new StringTokenizer(value.toString());
* while (itr.hasMoreTokens()) {
* word.set(itr.nextToken());
* context.write(word, one);
* }
* }
* }
* </pre></blockquote>
*
* <p>Applications may override the
* {@link #run(org.apache.hadoop.mapreduce.Mapper.Context)} method to exert
* greater control on map processing e.g. multi-threaded <code>Mapper</code>s
* etc.</p>
*
* @see InputFormat
* @see JobContext
* @see Partitioner
* @see Reducer
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}
mr用java实现一个数据清洗的案例,主要就是知道map端(偏移量和一行)和reduce端(map端的整合)的数据,理解了shuffle过程,不难
自定义数据传输过程中的javabean的话需要implements Writable(需要比较的话需要实现WritableComparable接口)
package local_program;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* Created by HP on 2019/4/14 21:57
*/
public class MovieBean implements Writable/*, WritableComparable */{
Integer mid;
Integer score;
String time;
Integer uid;
public MovieBean(Integer mid, Integer score, String time, Integer uid) {
this.mid = mid;
this.score = score;
this.time = time;
this.uid = uid;
}
public MovieBean() {
}
public Integer getMid() {
return mid;
}
public void setMid(Integer mid) {
this.mid = mid;
}
public Integer getScore() {
return score;
}
public void setScore(Integer score) {
this.score = score;
}
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public Integer getUid() {
return uid;
}
public void setUid(Integer uid) {
this.uid = uid;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(this.mid);
dataOutput.writeInt(this.score);
dataOutput.writeUTF(this.time);
dataOutput.writeInt(this.uid);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.mid=dataInput.readInt();
this.score=dataInput.readInt();
this.time=dataInput.readUTF();
this.uid=dataInput.readInt();
}
@Override
public String toString() {
return "MovieBean{" +
"mid=" + mid +
", score=" + score +
", time='" + time + '\'' +
", uid=" + uid +
'}';
}
/*@Override
public int compareTo(Object o) {
if(o instanceof MovieBean){
MovieBean mb=(MovieBean)o;
return mb.score>this.score?-1:1;
}
return 0;
}*/
}
也可以implements Tool