Hadoop（六）之Mapreduce高级特性2

如果要处理的数据是大量的小文件，使用上述这种默认切片机制，就会导致大量的切片，从而maptask进程数特别多，但是每一个切片又非常小，每个maptask的处理数据量就很小，从而，整体的效率会很低。
通用解决方案：就是将多个小文件划分成一个切片；实现办法就是自定义一个Inputformat子类重写里面的getSplits方法；
Mapreduce框架中自带了一个用于此场景的Inputformat实现类：CombineFileInputformat

1.2.3、数据切片与map任务数的机制

示例观察（多文件，大文件）

源码跟踪

TextInputFormat源码阅读

isSplitable() 判断要处理的数据是否可以做切片

getSplit() 规划切片信息(实现在FileInputFormat类中)

----TextInputformat切片逻辑：对每一个文件单独切片；切片大小默认等于blocksize

但是有两个参数可以调整：

如果是大量小文件，这种切片逻辑会有重大弊端：切片数量太多,maptask太多

createRecordReader() 构造一个记录读取器

具体读取数据的逻辑是实现在LineRecordReader中（按行读取数据，行起始偏移量作为key，行的内容作为value），比较特别的地方是：

LineRecordReader在读取一个具体的切片时，总是忽略掉第一行（针对的是：非第一切片），总是跨split多读一行(针对的是：非最末切片)

1.3、InputFormat的继承体系

1.3.1、InputFormat子类介绍：

（1）TextInputFormat(默认的输入格式类)详解

-- 源码结构 getsplits() reader

-- 为何不会出现一行被割断处理的原理

在LineRecordReader中，对split的第一行忽略

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
	
	… ………..

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
    … … … … 
//我们总是将第一条记录抛弃（文件第一个split除外）
//因为我们总是在nextKeyValue ()方法中跨split多读了一行（文件最后一个split除外）
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }

在LineRecordReader中，nextKeyValue ()方法总是跨split多读一行

public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    // 使用<=来多读取一行
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      newSize = in.readLine(value, maxLineLength,
          Math.max(maxBytesToConsume(pos), maxLineLength));
      pos += newSize;
      if (newSize < maxLineLength) {
        break;
	…. …. 
  }

（2）CombineTextInputFormat

它的切片逻辑跟TextInputformat完全不同：

CombineTextInputFormat可以将多个小文件划为一个切片

这种机制在处理海量小文件的场景下能提高效率

(小文件处理的机制，最优的是将小文件先合并再处理)

思路

CombineFileInputFormat涉及到三个重要的属性：

mapred.max.split.size：同一节点或同一机架的数据块形成切片时，切片大小的最大值；
mapred.min.split.size.per.node：同一节点的数据块形成切片时，切片大小的最小值；
mapred.min.split.size.per.rack：同一机架的数据块形成切片时，切片大小的最小值。

1.3.2、切片形成过程：

（1）逐个节点（数据块）形成切片；

a.遍历并累加这个节点上的数据块，如果累加数据块大小大于或等于mapred.max.split.size，则将这些数据块形成一个切片，继承该过程，直到剩余数据块累加大小小于mapred.max.split.size，则进行下一步；

b.如果剩余数据块累加大小大于或等于mapred.min.split.size.per.node，则将这些剩余数据块形成一个切片，如果剩余数据块累加大小小于mapred.min.split.size.per.node，则这些数据块留待后续处理。

（2）逐个机架（数据块）形成切片；

a.遍历并累加这个机架上的数据块（这些数据块即为上一步遗留下来的数据块），如果累加数据块大小大于或等于mapred.max.split.size，则将这些数据块形成一个切片，继承该过程，直到剩余数据块累加大小小于mapred.max.split.size，则进行下一步；

b.如果剩余数据块累加大小大于或等于mapred.min.split.size.per.rack，则将这些剩余数据块形成一个切片，如果剩余数据块累加大小小于mapred.min.split.size.per.rack，则这些数据块留待后续处理。

（3）遍历并累加剩余数据块，如果数据块大小大于或等于mapred.max.split.size，则将这些数据块形成一个切片，继承该过程，直到剩余数据块累加大小小于mapred.max.split.size，则进行下一步；

（4）剩余数据块形成一个切片。

1.3.3、核心实现

// mapping from a rack name to the list of blocks it has
HashMap<String,List<OneBlockInfo>> rackToBlocks = 
new HashMap<String,List<OneBlockInfo>>();
// mapping from a block to the nodes on which it has replicas
HashMap<OneBlockInfo,String[]> blockToNodes = 
new HashMap<OneBlockInfo,String[]>();
// mapping from a node to the list of blocks that it contains
HashMap<String,List<OneBlockInfo>> nodeToBlocks = 
new HashMap<String,List<OneBlockInfo>>();

开始形成切片之前，需要初始化三个重要的映射关系：

rackToBlocks：机架和数据块的对应关系，即某一个机架上有哪些数据块；
blockToNodes：数据块与节点的对应关系，即一块数据块的“拷贝”位于哪些节点；
nodeToBlocks：节点和数据块的对应关系，即某一个节点上有哪些数据块；

初始化过程如下代码所示，其中每一个Path代表的文件被形成一个OneFileInfo对象，映射关系也在形成OneFileInfo的过程中被维护。

// populate all the blocks for all fileslong totLength = 0;
for (int i = 0; i < paths.length; i++) {
  files[i] = new OneFileInfo(paths[i], job, 
                             rackToBlocks, blockToNodes, nodeToBlocks, rackToNodes);
  totLength += files[i].getLength();
}

（1）逐个节点（数据块）形成切片，代码如下：

// 保存当前切片所包含的数据块
    ArrayList<OneBlockInfo> validBlocks = new ArrayList<OneBlockInfo>();
    // 保存当前切片中的数据块属于哪些节点
    ArrayList<String> nodes = new ArrayList<String>();
    // 保存当前切片的大小long curSplitSize = 0;
 
    // process all nodes and create splits that arelocalto a node. 
    // 依次处理每个节点上的数据块for (Iterator<Map.Entry<String, List<OneBlockInfo>>> iter = nodeToBlocks.entrySet().iterator(); iter.hasNext();) {
      Map.Entry<String, List<OneBlockInfo>> one = iter.next();
      nodes.add(one.getKey());
      List<OneBlockInfo> blocksInNode = one.getValue();
 
      // for each block, copy it into validBlocks. Delete it from blockToNodes so that the same block does not appear in// two different splits.
      // 依次处理每个数据块，注意blockToNodes变量的作用，它保证了同一数据块不会出现在两个切片中for (OneBlockInfo oneblock : blocksInNode) {
        if (blockToNodes.containsKey(oneblock)) {
          validBlocks.add(oneblock);
          blockToNodes.remove(oneblock);
          curSplitSize += oneblock.length;// if the accumulated split size exceeds the maximum, then create this split.
          // 如果数据块累积大小大于或等于maxSize，则形成一个切片if (maxSize != 0 && curSplitSize >= maxSize) {
            //create an input split andadd it to the splits array            addCreatedSplit(job, splits, nodes, validBlocks);
            curSplitSize = 0;
            validBlocks.clear();
          }
        }
      }
      // if there were any blocks left over and their combined size is
      // larger than minSplitNode, then combine them into one split.
      // Otherwise add them back to the unprocessed pool. It is likely 
      // that they will be combined with other blocks from the same rack later on.
      // 如果剩余数据块大小大于或等于minSizeNode，则将这些数据块构成一个切片；
      // 如果剩余数据块大小小于minSizeNode，则将这些数据块归还给blockToNodes，交由后期“同一机架”过程处理if (minSizeNode != 0 && curSplitSize >= minSizeNode) {
        //create an input split andadd it to the splits array        addCreatedSplit(job, splits, nodes, validBlocks);
      } else {
        for (OneBlockInfo oneblock : validBlocks) {
          blockToNodes.put(oneblock, oneblock.hosts);
        }
      }
      validBlocks.clear();
      nodes.clear();
      curSplitSize = 0;
    }

（2）逐个机架（数据块）形成切片，代码如下：

// if blocks in a rack are below the specified minimum size, then keep them
    // in 'overflow'. After the processing of all racks is complete, these overflow
    // blocks will be combined into splits.
    // overflowBlocks用于保存“同一机架”过程处理之后剩余的数据块
    ArrayList<OneBlockInfo> overflowBlocks = new ArrayList<OneBlockInfo>();
    ArrayList<String> racks = new ArrayList<String>();
 
    // Process all racks over and over again until there is no more work to do.while (blockToNodes.size() > 0) {
      //Create one split for this rack before moving over to the next rack. 
      // Come back to this rack after creating a single split for each of the 
      // remaining racks.
      // Process one rack location at a time, Combine all possible blocks that
      // reside on this rack as one split. (constrained by minimum and maximum
      // split size).
 
      // iterate over all racks 
      // 依次处理每个机架for (Iterator<Map.Entry<String, List<OneBlockInfo>>> iter = 
           rackToBlocks.entrySet().iterator(); iter.hasNext();) {
        Map.Entry<String, List<OneBlockInfo>> one = iter.next();
        racks.add(one.getKey());
        List<OneBlockInfo> blocks = one.getValue();
 
        // for each block, copy it into validBlocks. Delete it from// blockToNodes so that the same block does not appear in// two different splits.boolean createdSplit = false;// 依次处理该机架的每个数据块for (OneBlockInfo oneblock : blocks) {
          if (blockToNodes.containsKey(oneblock)) {
            validBlocks.add(oneblock);
            blockToNodes.remove(oneblock);
            curSplitSize += oneblock.length;// if the accumulated split size exceeds the maximum, then create this split.
            // 如果数据块累积大小大于或等于maxSize，则形成一个切片if (maxSize != 0 && curSplitSize >= maxSize) {
              //create an input split andadd it to the splits array              addCreatedSplit(job, splits, getHosts(racks), validBlocks);
              createdSplit = true;
              break;
            }
          }
        }
 
        // if we created a split, then just go to the next rackif (createdSplit) {
          curSplitSize = 0;
          validBlocks.clear();
          racks.clear();
          continue;
        }
 
        if (!validBlocks.isEmpty()) {
          // 如果剩余数据块大小大于或等于minSizeRack，则将这些数据块构成一个切片if (minSizeRack != 0 && curSplitSize >= minSizeRack) {
            // if there is a mimimum size specified, then create a single split
            // otherwise, store these blocks into overflow data structure            addCreatedSplit(job, splits, getHosts(racks), validBlocks);
          } else {
            // There were a few blocks in this rack that remained to be processed.
            // Keep them in 'overflow' block list. These will be combined later.
            // 如果剩余数据块大小小于minSizeRack，则将这些数据块加入overflowBlocks            overflowBlocks.addAll(validBlocks);
          }
        }
        curSplitSize = 0;
        validBlocks.clear();
        racks.clear();
      }
    }

（3）遍历并累加剩余数据块，代码如下：

// Process all overflow blocksfor (OneBlockInfo oneblock : overflowBlocks) {
 validBlocks.add(oneblock); curSplitSize += oneblock.length;// This might cause an exiting rack location to be re-added,
// but it should be ok.for (int i = 0; i < oneblock.racks.length; i++) {
   racks.add(oneblock.racks[i]); }
// if the accumulated split size exceeds the maximum, then 
//create this split.
// 如果剩余数据块大小大于或等于maxSize，则将这些数据块构成一个切片if (maxSize != 0 && curSplitSize >= maxSize) {
//create an input split andadd it to the splits array   addCreatedSplit(job, splits, getHosts(racks), validBlocks);   curSplitSize = 0;
   validBlocks.clear();
   racks.clear();
 }
    }

（4）剩余数据块形成一个切片,代码如下：

// Process any remaining blocks, if any.if (!validBlocks.isEmpty()) {
      addCreatedSplit(job, splits, getHosts(racks), validBlocks);
    }

1.3.4、总结

CombineFileInputFormat形成切片过程中考虑数据本地性（同一节点、同一机架），首先处理同一节点的数据块，然后处理同一机架的数据块，最后处理剩余的数据块，可见本地性是逐步减弱的。另外CombineFileInputFormat是抽象的，具体使用时需要自己实现getRecordReader方法。

SequenceFileInputFormat/SequenceFileOutputFormat：

sequenceFile是hadoop中非常重要的一种数据格式

sequenceFile文件内部的数据组织形式是：K-V对

读入/写出为hadoop序列文件

2、MultipleInputs

虽然FileInputFormat可以读取多个目录，但是有些场景下我们要处理的数据可能有不同的来源，或者经历过版本升级而产生格式的差别。比如一些文件是tab分隔，一些文件是逗号分隔，此时就可以使用MultipleInputs，可以为不同的路径指定不同的mapper类来处理；

应用示例：假如某数据分析系统需要分析的数据有两类文件格式，一类为普通Text文本，一类为SequenceFile格式文件

2.1、实现步骤：

2.1.1、不同格式文件，相应地要编写两个不同逻辑的mapper类

static class TextMapperA extends Mapper<LongWritable, Text, Text, LongWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

	String line = value.toString();
	String[] words = line.split(" ");
	for (String w : words) {
		context.write(new Text(w), new LongWritable(1));
	}
    }
}

static class SequenceMapperB extends Mapper<Text, LongWritable, Text, LongWritable> {
	@Override
	protected void map(Text key, LongWritable value, Context context) throws IOException, InterruptedException {

		context.write(key, value);

	}
}

2.1.2、在job描述中使用MultipleInputs对不同类型数据设置不同的mapper类及inputformat来处理

public static void main(String[] args) throws Exception {
	Configuration conf = new Configuration();
	Job job = Job.getInstance(conf);
	job.setJarByClass(WordCount.class);
	// 为不同路径的文件指定不同的mapper类及inputformat类
	MultipleInputs.addInputPath(job, new Path("c:/wordcount/textdata"),TextInputFormat.class, TextMapperA.class);
	MultipleInputs.addInputPath(job, new Path("c:/wordcount/seqdata"),SequenceFileInputFormat.class, SequenceMapperB.class);
	job.setReducerClass(SameReducer.class);
	job.setMapOutputKeyClass(Text.class);
	job.setMapOutputValueClass(LongWritable.class);
	FileOutputFormat.setOutputPath(job, new Path("c:/wordcount/multiinouts"));
	job.waitForCompletion(true);
}

3、自定义Inputformat

当框架自带的TextInputformat，SequenceFileInputFormat不能满足需求时，可以自定义InputFormat来读取文件。

场景示例：将小文件整个合入大文件SequenceFile（在生产实际中常用，MR擅长处理大文件，而很多生产系统所产生的数据多为大量小文件，如果直接用hadoop来分析处理，效率较低，而通过SequenceFile可以方便地将小文件合并为大文件，从而提高处理效率）

通常做法是：将小文件的文件名作为key，将小文件的内容作为value，写入一个大的SequenceFile中。

3.1、代码实现：

3.1.1、自定义一个InputFormat

	static class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
		@Override
         //改写父类逻辑，总是返回false，从而让
		protected boolean isSplitable(JobContext context, Path filename) {
			return false;
		}

		@Override
		public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
              //返回一个自定义的RecordReader用于读取数据
			WholeFileRecordReader reader = new WholeFileRecordReader();
			reader.initialize(split, context);
			return reader;
		}
	}

3.1.2、实现自定义的WholeFileRecordReader

class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {

		private FileSplit fileSplit;
		private Configuration conf;
		//定义一个bytes缓存，用来存储一个小文件的数据内容
		private BytesWritable value = new BytesWritable();
		private boolean processed = false;

		//初始化方法，将传入的文件切片对象和context对象赋值给类成员
		@Override
		public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
			fileSplit = (FileSplit) split;
			conf = context.getConfiguration();
		}

		@Override
		//核心逻辑，用于从源数据中读取数据并封装为KEY  / VALUE
		public boolean nextKeyValue() throws IOException, InterruptedException {
			//当前小文件处理过，则processed为true
			if (!processed) {

				byte[] contents = new byte[(int) fileSplit.getLength()];
				Path filePath = fileSplit.getPath();
				FileSystem fs = filePath.getFileSystem(conf);
				FSDataInputStream in = fs.open(filePath);
				IOUtils.readFully(in, contents, 0, contents.length);
				value.set(contents, 0, contents.length);
				IOUtils.closeStream(in);
				processed = true;
				return true;

			}
			//如果当前小文件已经处理过，则返回false，以便调用者跳到下一个文件切片的处理
			return false;
		}

		@Override
		//返回一个key
		public NullWritable getCurrentKey() throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			return NullWritable.get();
		}

		@Override
		//返回一个values
		public BytesWritable getCurrentValue() throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			return value;
		}

		@Override
		//用于返回进度信息，读完一个小文件即返回1
		public float getProgress() throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			return processed ? 1.0f : 0.0f;
		}

		@Override
		public void close() throws IOException {
			// TODO Auto-generated method stub

		}

	}

4、Mapreduce输出格式组件

4.1、TextOutPutFormat源码结构解析：

4.2、MultipleOutputs

	static class CountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
		MultipleOutputs<Text, LongWritable> multipleOutputs = null;

		//在初始化方法中构造一个multipleOutputs
		protected void setup(Context context) throws IOException, InterruptedException {

			multipleOutputs = new MultipleOutputs<Text, LongWritable>(context);

		};

		@Override
		protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

			long count = 0;
			for (LongWritable value : values) {
				count += value.get();
			}
			if (key.toString().startsWith("a")) {
				//可以通过条件判断将不同内容写入不同文件
				multipleOutputs.write(key, new LongWritable(count), "c:/multi/outputa/a");
			} else {
				multipleOutputs.write(key, new LongWritable(count), "c:/sb/outputb/b");
			}

		}
		
		//一定要对multipleOutputs进行close，否则内容不会真实写入文件
		@Override
		protected void cleanup(Context context) throws IOException, InterruptedException {
			multipleOutputs.close();
		}

	}

4.3、自定义FileOutPutFormat

public class FlowOutputFormat extends FileOutputFormat<Text, NullWritable> {

	@Override
	public RecordWriter<Text, NullWritable> getRecordWriter(
			TaskAttemptContext context) throws IOException,
			InterruptedException {

		FileSystem fs = FileSystem.get(context.getConfiguration());

		Path enhancelog = new Path("hdfs://weekend01:9000/enhance/enhanced.log");
		Path tocrawl = new Path("hdfs://weekend01:9000/enhance/tocrawl.log");

		//构造两个不同的输出流
		FSDataOutputStream enhanceOs = fs.create(enhancelog);
		FSDataOutputStream tocrawlOs = fs.create(tocrawl);

		
		//通过构造函数将两个流传给FlowRecordWriter
		return new FlowRecordWriter(enhanceOs, tocrawlOs);
	}

	public static class FlowRecordWriter extends
			RecordWriter<Text, NullWritable> {

		private FSDataOutputStream enhanceOs;
		private FSDataOutputStream tocrawlOs;

		public FlowRecordWriter(FSDataOutputStream enhanceOs,
				FSDataOutputStream tocrawlOs) {

			this.enhanceOs = enhanceOs;
			this.tocrawlOs = tocrawlOs;

		}

		//具体的写出动作在write方法中完成
		@Override
		public void write(Text key, NullWritable value) throws IOException,
				InterruptedException {
			String line = key.toString();
			if (line.contains("itisok")) {
				enhanceOs.write(line.getBytes());

			} else {

				tocrawlOs.write(line.getBytes());

			}

		}

		@Override
		public void close(TaskAttemptContext context) throws IOException,
				InterruptedException {
			if (enhanceOs != null) {
				enhanceOs.close();
			}
			if (tocrawlOs != null) {
				tocrawlOs.close();
			}

		}
	}

}

5、Configuration配置对象与Toolrunner

5.1、配置参数的优先级：

集群*-site.xml < src/conf < conf.set()

5.2、Toolrunner----可通过提交命令动态设置配置参数或文件

Configuration对象还可以用来分发少量数据到所有任务节点

示例：

/**
可以通过运行时加参数来传递参数给conf对象
-D	property=value
-conf	filename	...
-fs	uri  等价于 -D	fs.defaultFS=uri
-jt	host:port 等价于 -D yarn.resourcemanager.address=host:port
-files file1,file2,
-archives archive1,archive2
-libjars jar1,jar2,...
 *
 * @author duanhaitao@itcast.cn
 *
 */
public class TestToolrunner extends Configured implements Tool {

	static {
		Configuration.addDefaultResource("hdfs-default.xml");
		Configuration.addDefaultResource("hdfs-site.xml");
		Configuration.addDefaultResource("core-default.xml");
		Configuration.addDefaultResource("core-site.xml");
		Configuration.addDefaultResource("mapred-default.xml");
		Configuration.addDefaultResource("mapred-site.xml");
		Configuration.addDefaultResource("yarn-default.xml");
		Configuration.addDefaultResource("yarn-site.xml");
	}

	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = getConf();
		TreeMap<String, String> treeMap = new TreeMap<String,String>();
		for (Entry<String, String> ent : conf) {
			treeMap.put(ent.getKey(), ent.getValue());
			
		}
		
		
		for (Entry<String, String> ent : treeMap.entrySet()) {

			System.out.printf("%s=%s\n", ent.getKey(), ent.getValue());

		}

		return 0;
	}
	
	public static void main(String[] args) throws Exception {
		ToolRunner.run(new TestToolrunner(), args);
	}

}

6、mapreduce数据压缩

运算密集型的job，少用压缩

IO密集型的job，多用压缩

通过压缩编码对mapper或者reducer的输出进行压缩，以减少磁盘IO，提供MR程序运行速度（但相应增加了cpu运算负担）。

6.1、MR支持的压缩编码

6.2、Reducer输出压缩

----设置

mapreduce.output.fileoutputformat.compress=false

mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

mapreduce.output.fileoutputformat.compress.type=RECORD

或在代码中设置

Job job = Job.getInstance(conf);

FileOutputFormat.setCompressOutput(job, true);

FileOutputFormat.setOutputCompressorClass(job, (Class<? extends CompressionCodec>) Class.forName(""));

6.3、Mapper输出压缩

----设置

mapreduce.map.output.compress=false

mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

或者在代码中：

conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);

conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC,

GzipCodec.class, CompressionCodec.class);

6.4、压缩文件的读取

Hadoop自带的InputFormat类内置支持压缩文件的读取，比如TextInputformat类，在其initialize方法中：

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    //根据文件后缀名创建相应压缩编码的codec
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;	
      decompressor = CodecPool.getDecompressor(codec);
	  //判断是否属于可切片压缩编码类型
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
		 //如果是可切片压缩编码，则创建一个CompressedSplitLineReader读取压缩数据
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
		//如果是不可切片压缩编码，则创建一个SplitLineReader读取压缩数据，并将文件输入流转换成解压数据流传递给普通SplitLineReader读取
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
	   //如果不是压缩文件，则创建普通SplitLineReader读取数据
      in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);
      filePosition = fileIn;
    }

7、mapreduce的计数器

7.1、mapreduce框架自带计数器：

Task group
Inputformat group
Outputformat group
Framework group

7.2、用户自定义计数器

枚举方式
动态设置

public class MultiOutputs {
	//通过枚举形式定义自定义计数器
	enum MyCounter{MALFORORMED,NORMAL}

	static class CommaMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

			String[] words = value.toString().split(",");

			for (String word : words) {
				context.write(new Text(word), new LongWritable(1));
			}
			//对枚举定义的自定义计数器加1
			context.getCounter(MyCounter.MALFORORMED).increment(1);
			//通过动态设置自定义计数器加1
			context.getCounter("counterGroupa", "countera").increment(1);
		}

	}

计数器原理简述（由appmaster维护，是一个全局的）。

8、mapreduce的日志分析

日志存放位置

系统服务进程的日志，默认存放在hadoop安装目录下的logs目录中

9、多job串联

借助jobControl类来建立job间顺序和依赖关系；

示例：

      ControlledJob cJob1 = new ControlledJob(job1.getConfiguration());
        ControlledJob cJob2 = new ControlledJob(job2.getConfiguration());
        ControlledJob cJob3 = new ControlledJob(job3.getConfiguration());
       
        // 设置作业依赖关系
        cJob2.addDependingJob(cJob1);
        cJob3.addDependingJob(cJob2);
 
        JobControl jobControl = new JobControl("RecommendationJob");
        jobControl.addJob(cJob1);
        jobControl.addJob(cJob2);
        jobControl.addJob(cJob3);
 
        cJob1.setJob(job1);
        cJob2.setJob(job2);
        cJob3.setJob(job3);
 
        // 新建一个线程来运行已加入JobControl中的作业，开始进程并等待结束
        Thread jobControlThread = new Thread(jobControl);
        jobControlThread.start();
        while (!jobControl.allFinished()) {
            Thread.sleep(500);
        }
        jobControl.stop();
 
        return 0;

一棵树～

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop（六）之Mapreduce高级特性2

目录1、Mapreduce程序运行并发度1.1、reduce task数量的决定机制1.2、map task数量的决定机制：1.2.1、Mapreduce框架中默认的切片机制：1.2.3、数据切片与map任务数的机制1.3、InputFormat的继承体系1.3.1、InputFormat子类介绍：1.3.2、切片形成过程：1.3.3、核心实现1.3.4、总...
复制链接

扫一扫

专栏目录