一.map源码与输入的key是什么?
1.源码中的模板设计模式
1.1 什么是模板设计模式?
参考:
java模板设计模式
https://www.cnblogs.com/yefengyu/p/10520531.html
1.2 map源码及模板设计模式
1.从Mapper类的源码注释也可以看出类的调用过程
* <p>The framework first calls
* {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
* {@link #map(Object, Object, Context)}
* for each key/value pair in the <code>InputSplit</code>. Finally
* {@link #cleanup(Context)} is called.</p>
* @see InputFormat
* @see JobContext
* @see Partitioner
* @see Reducer
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
......
}
2.从setup的注释可以看出每个maptask会执行一次setup方法,比如如果有两个inputSplits,就会执行两次setup方法。
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
22:26:26 INFO deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
22:26:26 INFO JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
22:26:27 WARN JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
22:26:27 WARN JobResourceUploader: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
22:26:28 INFO FileInputFormat: Total input paths to process : 2
22:26:28 INFO JobSubmitter: number of splits:2
22:26:28 INFO JobSubmitter: Submitting tokens for job: job_local2068300432_0001
22:26:28 INFO Job: The url to track the job: http://localhost:8080/
22:26:28 INFO Job: Running job: job_local2068300432_0001
22:26:28 INFO LocalJobRunner: OutputCommitter set in config null
22:26:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
22:26:28 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
22:26:28 INFO LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
22:26:28 INFO LocalJobRunner: Waiting for map tasks
22:26:28 INFO LocalJobRunner: Starting task: attempt_local2068300432_0001_m_000000_0
22:26:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
22:26:28 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
22:26:28 INFO ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
22:26:28 INFO Task: Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@502fb6c2
22:26:28 INFO MapTask: Processing split: file:/E:/ideaProgram/rzG9_My2/hdfsOperation/data/wc/ruozedata.txt:0+36
22:26:28 INFO MapTask: (EQUATOR) 0 kvi 26214396(104857584)
22:26:28 INFO MapTask: mapreduce.task.io.sort.mb: 100
22:26:28 INFO MapTask: soft limit at 83886080
22:26:28 INFO MapTask: bufstart = 0; bufvoid = 104857600
22:26:28 INFO MapTask: kvstart = 26214396; length = 6553600
22:26:28 INFO MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
-----------setup----------
-----------cleanup----------
22:26:28 INFO LocalJobRunner:
22:26:28 INFO MapTask: Starting flush of map output
22:26:28 INFO MapTask: Spilling map output
22:26:28 INFO MapTask: bufstart = 0; bufend = 67; bufvoid = 104857600
22:26:28 INFO MapTask: kvstart = 26214396(104857584); kvend = 26214368(104857472); length = 29/6553600
22:26:28 INFO MapTask: Finished spill 0
22:26:28 INFO Task: Task:attempt_local2068300432_0001_m_000000_0 is done. And is in the process of committing
22:26:28 INFO LocalJobRunner: map
22:26:28 INFO Task: Task 'attempt_local2068300432_0001_m_000000_0' done.
22:26:28 INFO LocalJobRunner: Finishing task: attempt_local2068300432_0001_m_000000_0
22:26:28 INFO LocalJobRunner: Starting task: attempt_local2068300432_0001_m_000001_0
22:26:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
22:26:28 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
22:26:28 INFO ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
22:26:28 INFO Task: Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@757aeebc
22:26:28 INFO MapTask: Processing split: file:/E:/ideaProgram/rzG9_My2/hdfsOperation/data/wc/ruozedata2.txt:0+15
22:26:28 INFO MapTask: (EQUATOR) 0 kvi 26214396(104857584)
22:26:28 INFO MapTask: mapreduce.task.io.sort.mb: 100
22:26:28 INFO MapTask: soft limit at 83886080
22:26:28 INFO MapTask: bufstart = 0; bufvoid = 104857600
22:26:28 INFO MapTask: kvstart = 26214396; length = 6553600
22:26:28 INFO MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
-----------setup----------
-----------cleanup----------
22:26:28 INFO LocalJobRunner: ......
输入源是228M文件,hdfs的block块是128M,会有两个inputSplit(128M +100M)
22:26:28 INFO FileInputFormat: Total input paths to process : 2
22:26:28 INFO JobSubmitter: number of splits:2
通过上面的日志可以看到setup与cleanup方法执行了两次,有两个
1.3 reduce的源码和设计模式
1.从源码中可以看出,和map基本一样,都是模板设计模式,这里不赘述了。
2.值得一提的是Reducer源码中注释看出reduce执行的大致流程,分为3个阶段,具体每个阶段的详细执行流程以后解释:
Shuffle --> Sort --> SecondarySort
* <p><code>Reducer</code> has 3 primary phases:</p>
* <ol>
* <li>
*
* <h4 id="Shuffle">Shuffle</h4>
*
* <p>The <code>Reducer</code> copies the sorted output from each
* {@link Mapper} using HTTP across the network.</p>
* </li>
*
* <li>
* <h4 id="Sort">Sort</h4>
*
* <p>The framework merge sorts <code>Reducer</code> inputs by
* <code>key</code>s
* (since different <code>Mapper</code>s may have output the same key).</p>
*
* <p>The shuffle and sort phases occur simultaneously i.e. while outputs are
* being fetched they are merged.</p>
*
* <h5 id="SecondarySort">SecondarySort</h5>
*
* <p>To achieve a secondary sort on the values returned by the value
* iterator, the application should extend the key with the secondary
* key and define a grouping comparator. The keys will be sorted using the
* entire key, but will be grouped using the grouping comparator to decide
* which keys and values are sent in the same call to reduce.The grouping
* comparator is specified via
* {@link Job#setGroupingComparatorClass(Class)}. The sort order is
* controlled by
* {@link Job#setSortComparatorClass(Class)}.</p>
@Checkpointable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
......
}
2.map方法中的输入key的偏移量是什么?
2.1 map数据文件的源数据
LI,Zhang,Xiao
J,XING,Zhang
Li,Xiao
2.2 使用dug调试 看key的值到底是什么?
key的偏移量就是输入数据源每行数据的位置(\n是换行符)
数据 LI,Zhang,X i a o \ n
偏移量 0123456789 10 11 12 13 14
数据 J ,XING,Zhang \ n
偏移量 15 ... 28
数据 L i,Xiao
偏移量 29 ...
具体查看下面的dug可以验证:
3.reduce 的key/value是什么值?
reduce的key是要聚合的key,按照上面的数据源:
LI,Zhang,Xiao
J,XING,Zhang
Li,Xiao,Zhang
进行debug可以发现 聚合的顺序是按照字典顺序升序的
J 1
LI 1
Li 1
XING 1
Xiao [1,1,1]
Zhang [1,1,1]
reduce的values值是一个Iterable<IntWritable> 类型,迭代器进行相加得到count输出值
二.map方法的源码提交流程
1.map 提交流程图
job.waitForCompletion
|--submit()
|--connect() |--LocalJobRunner或YarnRunner 本地模式或集群模式
|--submitJobInternal(Job.this, cluster) |--checkSpecs(job) 判断输出文件是否配置,输出文件是否存在
|--submitJobDir 存放的是作业相关配置信息
|--copyAndConfigureFiles(job, submitJobDir) 加载用户通过命令行传进来的jar或文件,即加载相关配置信息
|--submitJobFile 获取job的配置路径
|--writeSplits(job, submitJobDir) 【重中之重】决定map的个数
|--writeNewSplits(job, jobSubmitDir)
|--getSplits(job)
|--long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job))
|--long maxSize = getMaxSplitSize(job)
|--long blockSize = file.getBlockSize()
|--long splitSize = computeSplitSize(blockSize, minSize, maxSize)
|--Math.max(minSize, Math.min(maxSize, blockSize))
|--bytesRemaining)/splitSize > SPLIT_SLOP
|-- private static final double SPLIT_SLOP = 1.1; // 10% slop
|--writeConf(conf, submitJobFile) 将job文件写到提交目录
|--submitClient.submitJob 提交job并执行,返回job的运行状态:RUNNING SUCCEEDED FAILED PREP KILLED
Mapreduce大致为分为以下几个阶段:
inputSplits=> partitioner=>combiner=>group=>sort=>reduce
2.提交源码详细解析
2.1 job.waitForCompletion
mapreduce的Main方法中进行一系列的设置后,最后一步是提交job,这也是整个mapreduce job提交的开始
boolean result = job.waitForCompletion(true);
System.exit(result? 0:1 );
2.2 job.submit(job.java)
进入waitForCompletion,发现此方法再job.java类中,该方法调用了核心方法submit();
/**
* Submit the job to the cluster and wait for it to finish.
* @param verbose print the progress to the user
* @return true if the job succeeded
* @throws IOException thrown if the communication with the
* <code>JobTracker</code> is lost
*/
public boolean waitForCompletion(boolean verbose
) throws IOException, InterruptedException,
ClassNotFoundException {
if (state == JobState.DEFINE) {
//任务提交 调用核心方法
submit();
}
if (verbose) {
monitorAndPrintJob();
} else {
// get the completion poll interval from the client.
int completionPollIntervalMillis =
Job.getCompletionPollInterval(cluster.getConf());
while (!isComplete()) {
try {
Thread.sleep(completionPollIntervalMillis);
} catch (InterruptedException ie) {
}
}
}
return isSuccessful();
}
2.3 submitJobInternal(job.java)
同样ctrl+点击 进入submit()方法,发现该方法还是再job.java类中,此方法中最重要的是调用submitJobInternal。
/**
* Submit the job to the cluster and return immediately.
* @throws IOException
*/
public void submit()
throws IOException, InterruptedException, ClassNotFoundException {
ensureState(JobState.DEFINE);
setUseNewAPI();
connect();
final JobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
public JobStatus run() throws IOException, InterruptedException,
ClassNotFoundException {
// 实现 PrivilegedExceptionActio 接口,实现run方法,进入 submitJobInternal
return submitter.submitJobInternal(Job.this, cluster);
}
});
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
2.4 submitJobInternal (jobSubmitter.java)
此方法在jobSubmitter.java类中,这个方法执行过程很长,为了方法完整性,我把源码全部贴出,学友们只需要看我中文注释的地方。
JobStatus submitJobInternal(Job job, Cluster cluster)
throws ClassNotFoundException, InterruptedException, IOException {
//validate the jobs output specs
//a.检查输出文件有没配置/检查输出文件是否存在,在后面2.4.1中会详细讲解
checkSpecs(job);
Configuration conf = job.getConfiguration();
addMRFrameworkToDistributedCache(conf);
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
//configure the command line options correctly on the submitting dfs
InetAddress ip = InetAddress.getLocalHost();
if (ip != null) {
submitHostAddress = ip.getHostAddress();
submitHostName = ip.getHostName();
conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
}
JobID jobId = submitClient.getNewJobID();
job.setJobID(jobId);
//b.submitJobDir 中存放的是作业相关配置信息,可以看到上面是获取相关配置的代码
Path submitJobDir = new Path(jobStagingArea, jobId.toString());
JobStatus status = null;
try {
conf.set(MRJobConfig.USER_NAME,
UserGroupInformation.getCurrentUser().getShortUserName());
conf.set("hadoop.http.filter.initializers",
"org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());
LOG.debug("Configuring job " + jobId + " with " + submitJobDir
+ " as the submit dir");
// get delegation token for the dir
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] { submitJobDir }, conf);
populateTokenCache(conf, job.getCredentials());
// generate a secret to authenticate shuffle transfers
if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
KeyGenerator keyGen;
try {
keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
keyGen.init(SHUFFLE_KEY_LENGTH);
} catch (NoSuchAlgorithmException e) {
throw new IOException("Error generating shuffle secret key", e);
}
SecretKey shuffleKey = keyGen.generateKey();
TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
job.getCredentials());
}
if (CryptoUtils.isEncryptedSpillEnabled(conf)) {
conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);
LOG.warn("Max job attempts set to 1 since encrypted intermediate" +
"data spill is enabled");
}
//c.此处是加载用户通过命令行传进来的jar或文件,即加载相关配置信息
//进入该方法也可以看到注释 configure the jobconf of the user with the command line options of -libjars, -files, -archives.
copyAndConfigureFiles(job, submitJobDir);
//d.获取job的配置路径,到这里总之 a,b,c,d四个步骤是加载job的配置和路径等准备工作
Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);
// Create the splits for the job
LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
//e.【重中之重】此方法是重中之重,决定了map的splits的数量,将在2.4.2中详细讲解
int maps = writeSplits(job, submitJobDir);
conf.setInt(MRJobConfig.NUM_MAPS, maps);
LOG.info("number of splits:" + maps);
// write "queue admins of the queue to which job is being submitted"
// to job file.
String queue = conf.get(MRJobConfig.QUEUE_NAME,
JobConf.DEFAULT_QUEUE_NAME);
AccessControlList acl = submitClient.getQueueAdmins(queue);
conf.set(toFullPropertyName(queue,
QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());
// removing jobtoken referrals before copying the jobconf to HDFS
// as the tasks don't need this setting, actually they may break
// because of it if present as the referral will point to a
// different job.
TokenCache.cleanUpTokenReferral(conf);
if (conf.getBoolean(
MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
// Add HDFS tracking ids
ArrayList<String> trackingIds = new ArrayList<String>();
for (Token<? extends TokenIdentifier> t :
job.getCredentials().getAllTokens()) {
trackingIds.add(t.decodeIdentifier().getTrackingId());
}
conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
trackingIds.toArray(new String[trackingIds.size()]));
}
// Set reservation info if it exists
ReservationId reservationId = job.getReservationId();
if (reservationId != null) {
conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());
}
// Write job file to submit dir
//f.将job文件写到提交目录
writeConf(conf, submitJobFile);
//
// Now, actually submit the job (using the submit name)
//
printTokens(jobId, job.getCredentials());
//g.提交job并执行,返回job的运行状态:RUNNING SUCCEEDED FAILED PREP KILLED
status = submitClient.submitJob(
jobId, submitJobDir.toString(), job.getCredentials());
if (status != null) {
return status;
} else {
throw new IOException("Could not launch job");
}
} finally {
if (status == null) {
LOG.info("Cleaning up the staging area " + submitJobDir);
if (jtFs != null && submitJobDir != null)
jtFs.delete(submitJobDir, true);
}
}
}
2.4.1 JobSubmitter类中实现方法checkSpecs(jobs)
查看一个类中的重要方法,重要查看返回值的最后一行方法。特别时对于实现行数比较少的方法。此处看checkOutputSpecs
private void checkSpecs(Job job) throws ClassNotFoundException,
InterruptedException, IOException {
JobConf jConf = (JobConf)job.getConfiguration();
// Check the output specification
if (jConf.getNumReduceTasks() == 0 ?
jConf.getUseNewMapper() : jConf.getUseNewReducer()) {
org.apache.hadoop.mapreduce.OutputFormat<?, ?> output =
ReflectionUtils.newInstance(job.getOutputFormatClass(),
job.getConfiguration());
output.checkOutputSpecs(job);
} else {
//checkOutputSpecs 点进去
jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);
}
}
点进去发现这是接口OutputFormat的一个抽象方法, 要找到其实现方法,就要找到OutputFormat的实现类。OutputFormat的实现类有很多,我们看最常用的一种实现FileOutputFormat,可以看到该类中实现了方法checkOutputSpecs,可以看到InvalidJobConfException,FileAlreadyExistsException两个异常(看下面源码的中文注释)。
public void checkOutputSpecs(FileSystem ignored, JobConf job)
throws FileAlreadyExistsException,
InvalidJobConfException, IOException {
// Ensure that the output directory is set and not already there
Path outDir = getOutputPath(job);
if (outDir == null && job.getNumReduceTasks() != 0) {
//如果输出目录未配置,则会报此错误
throw new InvalidJobConfException("Output directory not set in JobConf.");
}
if (outDir != null) {
FileSystem fs = outDir.getFileSystem(job);
// normalize the output directory
outDir = fs.makeQualified(outDir);
setOutputPath(job, outDir);
// get delegation token for the outDir's file system
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] {outDir}, job);
// check its existence
if (fs.exists(outDir)) {
//如果输出目录已存在,则会报此错误
throw new FileAlreadyExistsException("Output directory " + outDir +
" already exists");
}
}
}
2.4.2 【重中之重】 maps = writeSplits(job, submitJobDir)
(1)首先记住结论
inputSplits个数现在可简单理解为block块个数(在不修改默认配置下)
block = InputSplit(切片)
(2)首先来看JobSubmitter类中writeSplits(job, submitJobDir)方法的实现
重点看writeNewSplits
private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
JobConf jConf = (JobConf)job.getConfiguration();
int maps;
if (jConf.getUseNewMapper()) {
//hadoop2.X及以上都是用的新API,查看这个
maps = writeNewSplits(job, jobSubmitDir);
} else {
maps = writeOldSplits(jConf, jobSubmitDir);
}
return maps;
}
(3)writeNewSplits方法中重点查看getSplits方法
private <T extends InputSplit>
int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = job.getConfiguration();
//通过反射获得InputFormat的实例
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
//input调用getSplits方法返回多个分片的集合,重点关注getSplits方法的实现
List<InputSplit> splits = input.getSplits(job);
T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new SplitComparator());
JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
jobSubmitDir.getFileSystem(conf), array);
return array.length;
}
(4)进入getSplits方法,发现其又是接口InputFormat的抽象方法,要找到其实现方法,还是一样,找到InputFormat的实现类,我们还是重点关注常用的实现类FileInputFormat中的getSplits方法实现,重点关注我中文注释的地方:
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = new StopWatch().start();
//获取分片的最小size
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
//获取分片的最大size
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
//获取blockSize
long blockSize = file.getBlockSize();
//通过computeSplitSize方法得到最终分片
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
//对Split得文件,在原有block块得基础上 溢出10%
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
if (LOG.isDebugEnabled()) {
// Log only if the file is big enough to be splitted
if (length > Math.min(file.getBlockSize(), minSize)) {
LOG.debug("File is not splittable so no parallelization "
+ "is possible: " + file.getPath());
}
}
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}
/*
*getSplits 方法中调用该方法计算最终分片
*/
protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
上面简单几行注释,分析的不够透彻,现将重点代码逐行解析
a.获取分片的最小size: minSize=1
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
==>
//getFormatMinSplitSize() 此值就是1
/**
* Get the lower bound on split size imposed by the format.
* @return the number of bytes of the minimal split for this format
*/
protected long getFormatMinSplitSize() {
return 1;
}
//getMinSplitSize 此值得默认值是1,可以通过mapreduce.input.fileinputformat.split.minsize配置进行修改。
/**
* Get the minimum split size
* @param job the job
* @return the minimum number of bytes that can be in a split
*/
public static long getMinSplitSize(JobContext job) {
return job.getConfiguration().getLong(SPLIT_MINSIZE, 1L);
}
public static final String SPLIT_MINSIZE =
"mapreduce.input.fileinputformat.split.minsize";
==>
long minSize=Math.max(1,1)=1
b.获取分片的最大size: maxSize=long得最大值=9223372036854775807
long maxSize = getMaxSplitSize(job);
//通过下面得源码可以知道
gotLong("mapreduce.input.fileinputformat.split.maxsize","0x7fffffffffffffffL")
即返回Long得最大值 9223372036854775807,可以认为是无限大
/**
* Get the maximum split size.
* @param context the job to look at.
* @return the maximum number of bytes a split can include
*/
public static long getMaxSplitSize(JobContext context) {
return context.getConfiguration().getLong(SPLIT_MAXSIZE,
Long.MAX_VALUE);
}
/**
* Get the value of the <code>name</code> property as a <code>long</code>.
* If no such property exists, the provided default value is returned,
* or if the specified value is not a valid <code>long</code>,
* then an error is thrown.
*
* @param name property name.
* @param defaultValue default value.
* @throws NumberFormatException when the value is invalid
* @return property value as a <code>long</code>,
* or <code>defaultValue</code>.
*/
public long getLong(String name, long defaultValue) {
String valueString = getTrimmed(name);
if (valueString == null)
return defaultValue;
String hexString = getHexDigits(valueString);
if (hexString != null) {
return Long.parseLong(hexString, 16);
}
return Long.parseLong(valueString);
}
public static final String SPLIT_MAXSIZE =
"mapreduce.input.fileinputformat.split.maxsize";
/**
* A constant holding the maximum value a {@code long} can
* have, 2<sup>63</sup>-1.
*/
@Native public static final long MAX_VALUE = 0x7fffffffffffffffL;
c.获取blockSize 本地模式默认32M,集群模式默认128M
long blockSize = file.getBlockSize();
场景一:如果是idea得Local模式运行,通过debug就可以知道clockSize为32M
//本地模式设置blockSize的参数,修改为128M
configuration.set("fs.local.block.size","134217728");
public abstract class FileSystem 中的方法
@Deprecated
public long getDefaultBlockSize() {
// default to 32MB: large enough to minimize the impact of seeks
return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
}
/** Return the number of bytes that large input files should be optimally
* be split into to minimize i/o time. The given path will be used to
* locate the actual filesystem. The full path does not have to exist.
* @param f path of file
* @return the default block size for the path's filesystem
*/
public long getDefaultBlockSize(Path f) {
return getDefaultBlockSize();
}
场景二:如果是hadoop集群上运行,集群中默认配置就是128M
在hadoop得配置文件hfds-site.xml中默认就是128M
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
<description>
The default block size for new files, in bytes.
You can use the following suffix (case insensitive):
k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
Or provide complete size in bytes (such as 134217728 for 128 MB).
</description>
</property>
d.通过computeSplitSize切分大小splitSize=blockSize=128M
# 先以集群模式举例,得到splitSize=blockSize=128M
splitSize=Math.max(minSize, Math.min(maxSize, blockSize))
=》Math.max(1,Math.min(9223372036854775807,134217728))
=》Math.max(1,134217728)
=》134217728
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
# Math.max(minVlue, Math.min(maxValue, MiddleValue)
我们来看看这个公式得本质
假设有 5 9 7 三个数字 此时最小值是5 ,最大值是9 , 中间值是7
带入公式: Math.max(5, Math.min(9, 7))=7
可以看出规律: 此公式就是取三个值中得中间值
#那么问题来了? 总结规律
当 minSize《 blockSize 《 maxSize 时,splitSize=blockSize
当 blockSize《 minSize《 maxSize 时,splitSize=minSize
当 minSize《 maxSize 《blockSize 时,splitSize=maxSize
由于minSize为1,maxSize为Long得最大值,所以splitSize在99.9%得情况下就是blockSize
e.blockSize溢出10%
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
private static final double SPLIT_SLOP = 1.1; // 10% slop
bytesRemaining/128M >1.1
=>bytesRemaining>140.8M
这个溢出是为了什么,举个例子,如果一个268M得文件,进行切分,可以切分得到 128M 128M 12M 三个分片,
但是由于hadoop有10%得blockSize得溢出机制,最后实际上只有2个分片,128M 140M
试想如果有一个文件最后一行内容为 a b c d e ,这个文件到c就有128M了,如果有10%得溢出,则d e就可以包含在一个分片中,
这样就可以保证同一个文件得一行内容在一个map中。这个例子虽然举得不严谨,但是可以用来理解溢出10%得好处