hadoop的inputformat包括他的子类reader是maptask读取数据的重要步骤
一、获得splits-mapper数
1. jobclinet的submitJobInternal,生成split,获取mapper数量
- public
- RunningJob submitJobInternal {
- return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() {
- ....
- int maps = writeSplits(context, submitJobDir);//<span style="font-family: Helvetica, Tahoma, Arial, sans-serif; white-space: normal; background-color: #ffffff;">生成split,获取mapper数量</span>
- ....
- }}
- private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
- Path jobSubmitDir) throws IOException,
- InterruptedException, ClassNotFoundException {
- JobConf jConf = (JobConf)job.getConfiguration();
- int maps;
- if (jConf.getUseNewMapper()) {
- maps = writeNewSplits(job, jobSubmitDir);//新api调用此方法
- } else {
- maps = writeOldSplits(jConf, jobSubmitDir);
- }
- return maps;
- }
- private <T extends InputSplit>
- int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
- InterruptedException, ClassNotFoundException {
- Configuration conf = job.getConfiguration();
- InputFormat<?, ?> input =
- ReflectionUtils.newInstance(job.getInputFormatClass(), conf);//反射到inputsplit
- List<InputSplit> splits = input.getSplits(job);//调用inputformat子类实现的getsplits方法
- T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);//生成数组,这么简单的方法写的这么复杂,真够扯的,不懂这样为了什么
- // sort the splits into order based on size, so that the biggest
- // go first
- Arrays.sort(array, new SplitComparator());//splits排序
- JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
- jobSubmitDir.getFileSystem(conf), array);
- return array.length;//mapper数量
- }
3.贴上最常用的FileInputSplit的getSplits方法
- public List<InputSplit> getSplits(JobContext job
- ) throws IOException {
- long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
- long maxSize = getMaxSplitSize(job);
- // generate splits
- List<InputSplit> splits = new ArrayList<InputSplit>();
- List<FileStatus>files = listStatus(job);
- for (FileStatus file: files) {
- Path path = file.getPath();
- FileSystem fs = path.getFileSystem(job.getConfiguration());
- long length = file.getLen();
- BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
- if ((length != 0) && isSplitable(job, path)) {
- long blockSize = file.getBlockSize();
- long splitSize = computeSplitSize(blockSize, minSize, maxSize);//获得split文件的最大文件大小
- long bytesRemaining = length;
- while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {//分解大文件
- int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
- splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
- blkLocations[blkIndex].getHosts()));
- bytesRemaining -= splitSize;
- }
- if (bytesRemaining != 0) {
- splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
- blkLocations[blkLocations.length-1].getHosts()));
- }
- } else if (length != 0) {
- splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
- } else {
- //Create empty hosts array for zero length files
- splits.add(new FileSplit(path, 0, length, new String[0]));
- }
- }
- // Save the number of input files in the job-conf
- job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
- LOG.debug("Total # of splits: " + splits.size());
- return splits;
- }
二、读取keyvalue的过程
1.实例化inputformat,初始化reader
在MapTask类的runNewMapper方法中,生成inputformat和recordreader,并进行初始化,运行mapper
MapTask$NewTrackingRecordReader 由 RecordReader组成,是它的一个代理类
- private <INKEY,INVALUE,OUTKEY,OUTVALUE>
- void runNewMapper {
- // 生成自定义inputformat
- org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
- (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
- ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
- .....
- //生成自定义recordreader
- org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
- new NewTrackingRecordReader<INKEY,INVALUE>
- (split, inputFormat, reporter, job, taskContext);
- .....
- //初始化recordreader
- input.initialize(split, mapperContext);
- .....
- //运行mapper
- mapper.run(mapperContext);
- }
2.在运行mapper中,调用context让reader读取key和value,其中使用代理类MapTask$NewTrackingRecordReader,添加并推送读取记录
mapper代码:
- public void run(Context context) throws IOException, InterruptedException {
- setup(context);
- while (context.nextKeyValue()) {
- map(context.getCurrentKey(), context.getCurrentValue(), context);
- }
- cleanup(context);
- }
MapContext代码:
- @Override
- public boolean nextKeyValue() throws IOException, InterruptedException {
- return reader.nextKeyValue();
- }
- @Override
- public KEYIN getCurrentKey() throws IOException, InterruptedException {
- return reader.getCurrentKey();
- }
- @Override
- public VALUEIN getCurrentValue() throws IOException, InterruptedException {
- return reader.getCurrentValue();
- }
MapTask$NewTrackingRecordReader的代码:
- @Override
- public boolean nextKeyValue() throws IOException, InterruptedException {
- boolean result = false;
- try {
- long bytesInPrev = getInputBytes(fsStats);
- result = real.nextKeyValue();//recordreader实际读取数据
- long bytesInCurr = getInputBytes(fsStats);
- if (result) {
- inputRecordCounter.increment(1);//添加读取记录
- fileInputByteCounter.increment(bytesInCurr - bytesInPrev);//记录读取数据
- }
- reporter.setProgress(getProgress());//将reporter的flag置为true,推送记录信息
- } catch (IOException ioe) {
- if (inputSplit instanceof FileSplit) {
- FileSplit fileSplit = (FileSplit) inputSplit;
- LOG.error("IO error in map input file "
- + fileSplit.getPath().toString());
- throw new IOException("IO error in map input file "
- + fileSplit.getPath().toString(), ioe);
- }
- throw ioe;
- }
- return result;
- }
3.执行完mapper方法,返回到maptask,关闭reader
- mapper.run(mapperContext);
- input.close();//关闭inputformat
- output.close(mapperContext);
两个步骤不在同一个线程中完成,生成splits后进入monitor阶段
以上也调用了所有的inputformat虚类的所有方法