目录
2.1.1、不同格式文件,相应地要编写两个不同逻辑的mapper类
2.1.2、在job描述中使用MultipleInputs对不同类型数据设置不同的mapper类及inputformat来处理
3.1.2、实现自定义的WholeFileRecordReader
5、Configuration配置对象与Toolrunner
5.2、Toolrunner----可通过提交命令动态设置配置参数或文件
1、Mapreduce程序运行并发度
1.1、reduce task数量的决定机制
1、业务逻辑需要
2、数据量大小
设置方法:
job.setNumReduceTasks(5)
1.2、map task数量的决定机制:
由于map task之间没有协作关系,每一个map task都是各自为政,在map task的处理中没法做“全局”性的聚合操作,所以map task的数量完全取决于所处理的数据量的大小。
决定机制:
对待处理数据进行“切片”
每一个切片分配一个map task来处理
1.2.1、Mapreduce框架中默认的切片机制:
TextInputFormat.getSplits()继承自FileInputFormat.getSplits()
1:定义一个切片大小:可以通过参数来调节,默认情况下等于“hdfs中设置的blocksize”,通常是128M
2:获取输入数据目录下所有待处理文件List
3:遍历文件List,逐个逐个文件进行切片
for(file:List)对file从0偏移量开始切,每到128M就构成一个切片,比如a.txt(200M),就会被切成两个切片: a.txt: 0-128M, a.txt :128M-256M
再比如b.txt(80M),就会切成一个切片, b.txt :0-80M
- 如果要处理的数据是大量的小文件,使用上述这种默认切片机制,就会导致大量的切片,从而maptask进程数特别多,但是每一个切片又非常小,每个maptask的处理数据量就很小,从而,整体的效率会很低。
- 通用解决方案:就是将多个小文件划分成一个切片;实现办法就是自定义一个Inputformat子类重写里面的getSplits方法;
- Mapreduce框架中自带了一个用于此场景的Inputformat实现类:CombineFileInputformat
1.2.3、数据切片与map任务数的机制
示例观察(多文件,大文件)
源码跟踪
TextInputFormat源码阅读
isSplitable() 判断要处理的数据是否可以做切片
getSplit() 规划切片信息(实现在FileInputFormat类中)
----TextInputformat切片逻辑: 对每一个文件单独切片;切片大小默认等于blocksize
但是有两个参数可以调整:
如果是大量小文件,这种切片逻辑会有重大弊端:切片数量太多,maptask太多
createRecordReader() 构造一个记录读取器
具体读取数据的逻辑是实现在LineRecordReader中 (按行读取数据,行起始偏移量作为key,行的内容作为value),比较特别的地方是:
LineRecordReader在读取一个具体的切片时,总是忽略掉第一行(针对的是:非第一切片),总是跨split多读一行(针对的是:非最末切片)
1.3、InputFormat的继承体系
1.3.1、InputFormat子类介绍:
(1)TextInputFormat(默认的输入格式类)详解
-- 源码结构 getsplits() reader
-- 为何不会出现一行被割断处理的原理
- 在LineRecordReader中,对split的第一行忽略
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
… ………..
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
… … … …
//我们总是将第一条记录抛弃(文件第一个split除外)
//因为我们总是在nextKeyValue ()方法中跨split多读了一行(文件最后一个split除外)
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
- 在LineRecordReader中,nextKeyValue ()方法总是跨split多读一行
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// 使用<=来多读取一行
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
pos += newSize;
if (newSize < maxLineLength) {
break;
…. ….
}
(2)CombineTextInputFormat
它的切片逻辑跟TextInputformat完全不同:
CombineTextInputFormat可以将多个小文件划为一个切片
这种机制在处理海量小文件的场景下能提高效率
(小文件处理的机制,最优的是将小文件先合并再处理)
思路
CombineFileInputFormat涉及到三个重要的属性:
- mapred.max.split.size:同一节点或同一机架的数据块形成切片时,切片大小的最大值;
- mapred.min.split.size.per.node:同一节点的数据块形成切片时,切片大小的最小值;
- mapred.min.split.size.per.rack:同一机架的数据块形成切片时,切片大小的最小值。
1.3.2、切片形成过程:
(1)逐个节点(数据块)形成切片;
a.遍历并累加这个节点上的数据块,如果累加数据块大小大于或等于mapred.max.split.size,则将这些数据块形成一个切片,继承该过程,直到剩余数据块累加大小小于mapred.max.split.size,则进行下一步;
b.如果剩余数据块累加大小大于或等于mapred.min.split.size.per.node,则将这些剩余数据块形成一个切片,如果剩余数据块累加大小小于mapred.min.split.size.per.node,则这些数据块留待后续处理。
(2)逐个机架(数据块)形成切片;
a.遍历并累加这个机架上的数据块(这些数据块即为上一步遗留下来的数据块),如果累加数据块大小大于或等于mapred.max.split.size,则将这些数据块形成一个切片,继承该过程,直到剩余数据块累加大小小于mapred.max.split.size,则进行下一步;
b.如果剩余数据块累加大小大于或等于mapred.min.split.size.per.rack,则将这些剩余数据块形成一个切片,如果剩余数据块累加大小小于mapred.min.split.size.per.rack,则这些数据块留待后续处理。
(3)遍历并累加剩余数据块,如果数据块大小大于或等于mapred.max.split.size,则将这些数据块形成一个切片,继承该过程,直到剩余数据块累加大小小于mapred.max.split.size,则进行下一步;
(4)剩余数据块形成一个切片。
1.3.3、核心实现
// mapping from a rack name to the list of blocks it has
HashMap<String,List<OneBlockInfo>> rackToBlocks =
new HashMap<String,List<OneBlockInfo>>();
// mapping from a block to the nodes on which it has replicas
HashMap<OneBlockInfo,String[]> blockToNodes =
new HashMap<OneBlockInfo,String[]>();
// mapping from a node to the list of blocks that it contains
HashMap<String,List<OneBlockInfo>> nodeToBlocks =
new HashMap<String,List<OneBlockInfo>>();
开始形成切片之前,需要初始化三个重要的映射关系:
- rackToBlocks:机架和数据块的对应关系,即某一个机架上有哪些数据块;
- blockToNodes:数据块与节点的对应关系,即一块数据块的“拷贝”位于哪些节点;
- nodeToBlocks:节点和数据块的对应关系,即某一个节点上有哪些数据块;
初始化过程如下代码所示,其中每一个Path代表的文件被形成一个OneFileInfo对象,映射关系也在形成OneFileInfo的过程中被维护。
// populate all the blocks for all fileslong totLength = 0;
for (int i = 0; i < paths.length; i++) {
files[i] = new OneFileInfo(paths[i], job,
rackToBlocks, blockToNodes, nodeToBlocks, rackToNodes);
totLength += files[i].getLength();
}
(1)逐个节点(数据块)形成切片,代码如下:
// 保存当前切片所包含的数据块
ArrayList<OneBlockInfo> validBlocks = new ArrayList<OneBlockInfo>();
// 保存当前切片中的数据块属于哪些节点
ArrayList<String> nodes = new ArrayList<String>();
// 保存当前切片的大小long curSplitSize = 0;
// process all nodes and create splits that arelocalto a node.
// 依次处理每个节点上的数据块for (Iterator<Map.Entry<String, List<OneBlockInfo>>> iter = nodeToBlocks.entrySet().iterator(); iter.hasNext();) {
Map.Entry<String, List<OneBlockInfo>> one = iter.next();
nodes.add(one.getKey());
List<OneBlockInfo> blocksInNode = one.getValue();
// for each block, copy it into validBlocks. Delete it from blockToNodes so that the same block does not appear in// two different splits.
// 依次处理每个数据块,注意blockToNodes变量的作用,它保证了同一数据块不会出现在两个切片中for (OneBlockInfo oneblock : blocksInNode) {
if (blockToNodes.containsKey(oneblock)) {
validBlocks.add(oneblock);
blockToNodes.remove(oneblock);
curSplitSize += oneblock.length;// if the accumulated split size exceeds the maximum, then create this split.
// 如果数据块累积大小大于或等于maxSize,则形成一个切片if (maxSize != 0 && curSplitSize >= maxSize) {
//create an input split andadd it to the splits array addCreatedSplit(job, splits, nodes, validBlocks);
curSplitSize = 0;
validBlocks.clear();
}
}
}
// if there were any blocks left over and their combined size is
// larger than minSplitNode, then combine them into one split.
// Otherwise add them back to the unprocessed pool. It is likely
// that they will be combined with other blocks from the same rack later on.
// 如果剩余数据块大小大于或等于minSizeNode,则将这些数据块构成一个切片;
// 如果剩余数据块大小小于minSizeNode,则将这些数据块归还给blockToNodes,交由后期“同一机架”过程处理if (minSizeNode != 0 && curSplitSize >= minSizeNode) {
//create an input split andadd it to the splits array addCreatedSplit(job, splits, nodes, validBlocks);
} else {
for (OneBlockInfo oneblock : validBlocks) {
blockToNodes.put(oneblock, oneblock.hosts);
}
}
validBlocks.clear();
nodes.clear();
curSplitSize = 0;
}
(2)逐个机架(数据块)形成切片,代码如下:
// if blocks in a rack are below the specified minimum size, then keep them
// in 'overflow'. After the processing of all racks is complete, these overflow
// blocks will be combined into splits.
// overflowBlocks用于保存“同一机架”过程处理之后剩余的数据块
ArrayList<OneBlockInfo> overflowBlocks = new ArrayList<OneBlockInfo>();
ArrayList<String> racks = new ArrayList<String>();
// Process all racks over and over again until there is no more work to do.while (blockToNodes.size() > 0) {
//Create one split for this rack before moving over to the next rack.
// Come back to this rack after creating a single split for each of the
// remaining racks.
// Process one rack location at a time, Combine all possible blocks that
// reside on this rack as one split. (constrained by minimum and maximum
// split size).
// iterate over all racks
// 依次处理每个机架for (Iterator<Map.Entry<String, List<OneBlockInfo>>> iter =
rackToBlocks.entrySet().iterator(); iter.hasNext();) {
Map.Entry<String, List<OneBlockInfo>> one = iter.next();
racks.add(one.getKey());
List<OneBlockInfo> blocks = one.getValue();
// for each block, copy it into validBlocks. Delete it from// blockToNodes so that the same block does not appear in// two different splits.boolean createdSplit = false;// 依次处理该机架的每个数据块for (OneBlockInfo oneblock : blocks) {
if (blockToNodes.containsKey(oneblock)) {
validBlocks.add(oneblock);
blockToNodes.remove(oneblock);
curSplitSize += oneblock.length;// if the accumulated split size exceeds the maximum, then create this split.
// 如果数据块累积大小大于或等于maxSize,则形成一个切片if (maxSize != 0 && curSplitSize >= maxSize) {
//create an input split andadd it to the splits array addCreatedSplit(job, splits, getHosts(racks), validBlocks);
createdSplit = true;
break;
}
}
}
// if we created a split, then just go to the next rackif (createdSplit) {
curSplitSize = 0;
validBlocks.clear();
racks.clear();
continue;
}
if (!validBlocks.isEmpty()) {
// 如果剩余数据块大小大于或等于minSizeRack,则将这些数据块构成一个切片if (minSizeRack != 0 && curSplitSize >= minSizeRack) {
// if there is a mimimum size specified, then create a single split
// otherwise, store these blocks into overflow data structure addCreatedSplit(job, splits, getHosts(racks), validBlocks);
} else {
// There were a few blocks in this rack that remained to be processed.
// Keep them in 'overflow' block list. These will be combined later.
// 如果剩余数据块大小小于minSizeRack,则将这些数据块加入overflowBlocks overflowBlocks.addAll(validBlocks);
}
}
curSplitSize = 0;
validBlocks.clear();
racks.clear();
}
}
(3)遍历并累加剩余数据块,代码如下:
// Process all overflow blocksfor (OneBlockInfo oneblock : overflowBlocks) {
validBlocks.add(oneblock); curSplitSize += oneblock.length;// This might cause an exiting rack location to be re-added,
// but it should be ok.for (int i = 0; i < oneblock.racks.length; i++) {
racks.add(oneblock.racks[i]); }
// if the accumulated split size exceeds the maximum, then
//create this split.
// 如果剩余数据块大小大于或等于maxSize,则将这些数据块构成一个切片if (maxSize != 0 && curSplitSize >= maxSize) {
//create an input split andadd it to the splits array addCreatedSplit(job, splits, getHosts(racks), validBlocks); curSplitSize = 0;
validBlocks.clear();
racks.clear();
}
}
(4)剩余数据块形成一个切片,代码如下:
// Process any remaining blocks, if any.if (!validBlocks.isEmpty()) {
addCreatedSplit(job, splits, getHosts(racks), validBlocks);
}
1.3.4、总结
CombineFileInputFormat形成切片过程中考虑数据本地性(同一节点、同一机架),首先处理同一节点的数据块,然后处理同一机架的数据块,最后处理剩余的数据块,可见本地性是逐步减弱的。另外CombineFileInputFormat是抽象的,具体使用时需要自己实现getRecordReader方法。
SequenceFileInputFormat/SequenceFileOutputFormat:
sequenceFile是hadoop中非常重要的一种数据格式
sequenceFile文件内部的数据组织形式是:K-V对
读入/写出为hadoop序列文件
2、MultipleInputs
虽然FileInputFormat可以读取多个目录,但是有些场景下我们要处理的数据可能有不同的来源,或者经历过版本升级而产生格式的差别。比如一些文件是tab分隔,一些文件是逗号分隔,此时就可以使用MultipleInputs,可以为不同的路径指定不同的mapper类来处理;
应用示例:假如某数据分析系统需要分析的数据有两类文件格式,一类为普通Text文本,一类为SequenceFile格式文件
2.1、实现步骤:
2.1.1、不同格式文件,相应地要编写两个不同逻辑的mapper类
static class TextMapperA extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String w : words) {
context.write(new Text(w), new LongWritable(1));
}
}
}
static class SequenceMapperB extends Mapper<Text, LongWritable, Text, LongWritable> {
@Override
protected void map(Text key, LongWritable value, Context context) throws IOException, InterruptedException {
context.write(key, value);
}
}
2.1.2、在job描述中使用MultipleInputs对不同类型数据设置不同的mapper类及inputformat来处理
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(WordCount.class);
// 为不同路径的文件指定不同的mapper类及inputformat类
MultipleInputs.addInputPath(job, new Path("c:/wordcount/textdata"),TextInputFormat.class, TextMapperA.class);
MultipleInputs.addInputPath(job, new Path("c:/wordcount/seqdata"),SequenceFileInputFormat.class, SequenceMapperB.class);
job.setReducerClass(SameReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, new Path("c:/wordcount/multiinouts"));
job.waitForCompletion(true);
}
3、自定义Inputformat
当框架自带的TextInputformat,SequenceFileInputFormat不能满足需求时,可以自定义InputFormat来读取文件。
场景示例:将小文件整个合入大文件SequenceFile(在生产实际中常用,MR擅长处理大文件,而很多生产系统所产生的数据多为大量小文件,如果直接用hadoop来分析处理,效率较低,而通过SequenceFile可以方便地将小文件合并为大文件,从而提高处理效率)
通常做法是:将小文件的文件名作为key,将小文件的内容作为value,写入一个大的SequenceFile中。
3.1、代码实现:
3.1.1、自定义一个InputFormat
static class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
@Override
//改写父类逻辑,总是返回false,从而让
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
//返回一个自定义的RecordReader用于读取数据
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
}
3.1.2、实现自定义的WholeFileRecordReader
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
private FileSplit fileSplit;
private Configuration conf;
//定义一个bytes缓存,用来存储一个小文件的数据内容
private BytesWritable value = new BytesWritable();
private boolean processed = false;
//初始化方法,将传入的文件切片对象和context对象赋值给类成员
@Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
fileSplit = (FileSplit) split;
conf = context.getConfiguration();
}
@Override
//核心逻辑,用于从源数据中读取数据并封装为KEY / VALUE
public boolean nextKeyValue() throws IOException, InterruptedException {
//当前小文件处理过,则processed为true
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path filePath = fileSplit.getPath();
FileSystem fs = filePath.getFileSystem(conf);
FSDataInputStream in = fs.open(filePath);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
IOUtils.closeStream(in);
processed = true;
return true;
}
//如果当前小文件已经处理过,则返回false,以便调用者跳到下一个文件切片的处理
return false;
}
@Override
//返回一个key
public NullWritable getCurrentKey() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return NullWritable.get();
}
@Override
//返回一个values
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return value;
}
@Override
//用于返回进度信息,读完一个小文件即返回1
public float getProgress() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
}
4、Mapreduce输出格式组件
4.1、TextOutPutFormat源码结构解析:
4.2、MultipleOutputs
static class CountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
MultipleOutputs<Text, LongWritable> multipleOutputs = null;
//在初始化方法中构造一个multipleOutputs
protected void setup(Context context) throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<Text, LongWritable>(context);
};
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long count = 0;
for (LongWritable value : values) {
count += value.get();
}
if (key.toString().startsWith("a")) {
//可以通过条件判断将不同内容写入不同文件
multipleOutputs.write(key, new LongWritable(count), "c:/multi/outputa/a");
} else {
multipleOutputs.write(key, new LongWritable(count), "c:/sb/outputb/b");
}
}
//一定要对multipleOutputs进行close,否则内容不会真实写入文件
@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}
4.3、自定义FileOutPutFormat
public class FlowOutputFormat extends FileOutputFormat<Text, NullWritable> {
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(
TaskAttemptContext context) throws IOException,
InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
Path enhancelog = new Path("hdfs://weekend01:9000/enhance/enhanced.log");
Path tocrawl = new Path("hdfs://weekend01:9000/enhance/tocrawl.log");
//构造两个不同的输出流
FSDataOutputStream enhanceOs = fs.create(enhancelog);
FSDataOutputStream tocrawlOs = fs.create(tocrawl);
//通过构造函数将两个流传给FlowRecordWriter
return new FlowRecordWriter(enhanceOs, tocrawlOs);
}
public static class FlowRecordWriter extends
RecordWriter<Text, NullWritable> {
private FSDataOutputStream enhanceOs;
private FSDataOutputStream tocrawlOs;
public FlowRecordWriter(FSDataOutputStream enhanceOs,
FSDataOutputStream tocrawlOs) {
this.enhanceOs = enhanceOs;
this.tocrawlOs = tocrawlOs;
}
//具体的写出动作在write方法中完成
@Override
public void write(Text key, NullWritable value) throws IOException,
InterruptedException {
String line = key.toString();
if (line.contains("itisok")) {
enhanceOs.write(line.getBytes());
} else {
tocrawlOs.write(line.getBytes());
}
}
@Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
if (enhanceOs != null) {
enhanceOs.close();
}
if (tocrawlOs != null) {
tocrawlOs.close();
}
}
}
}
5、Configuration配置对象与Toolrunner
5.1、配置参数的优先级:
集群*-site.xml < src/conf < conf.set()
5.2、Toolrunner----可通过提交命令动态设置配置参数或文件
Configuration对象还可以用来分发少量数据到所有任务节点
示例:
/**
可以通过运行时加参数来传递参数给conf对象
-D property=value
-conf filename ...
-fs uri 等价于 -D fs.defaultFS=uri
-jt host:port 等价于 -D yarn.resourcemanager.address=host:port
-files file1,file2,
-archives archive1,archive2
-libjars jar1,jar2,...
*
* @author duanhaitao@itcast.cn
*
*/
public class TestToolrunner extends Configured implements Tool {
static {
Configuration.addDefaultResource("hdfs-default.xml");
Configuration.addDefaultResource("hdfs-site.xml");
Configuration.addDefaultResource("core-default.xml");
Configuration.addDefaultResource("core-site.xml");
Configuration.addDefaultResource("mapred-default.xml");
Configuration.addDefaultResource("mapred-site.xml");
Configuration.addDefaultResource("yarn-default.xml");
Configuration.addDefaultResource("yarn-site.xml");
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
TreeMap<String, String> treeMap = new TreeMap<String,String>();
for (Entry<String, String> ent : conf) {
treeMap.put(ent.getKey(), ent.getValue());
}
for (Entry<String, String> ent : treeMap.entrySet()) {
System.out.printf("%s=%s\n", ent.getKey(), ent.getValue());
}
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new TestToolrunner(), args);
}
}
6、mapreduce数据压缩
运算密集型的job,少用压缩
IO密集型的job,多用压缩
通过压缩编码对mapper或者reducer的输出进行压缩,以减少磁盘IO,提供MR程序运行速度(但相应增加了cpu运算负担)。
6.1、MR支持的压缩编码
6.2、Reducer输出压缩
----设置
mapreduce.output.fileoutputformat.compress=false
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type=RECORD
或在代码中设置
Job job = Job.getInstance(conf); FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, (Class<? extends CompressionCodec>) Class.forName("")); |
6.3、Mapper输出压缩
----设置
mapreduce.map.output.compress=false
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
或者在代码中:
conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true); conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, GzipCodec.class, CompressionCodec.class); |
6.4、压缩文件的读取
Hadoop自带的InputFormat类内置支持压缩文件的读取,比如TextInputformat类,在其initialize方法中:
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
//根据文件后缀名创建相应压缩编码的codec
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
isCompressedInput = true;
decompressor = CodecPool.getDecompressor(codec);
//判断是否属于可切片压缩编码类型
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
//如果是可切片压缩编码,则创建一个CompressedSplitLineReader读取压缩数据
in = new CompressedSplitLineReader(cIn, job,
this.recordDelimiterBytes);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn;
} else {
//如果是不可切片压缩编码,则创建一个SplitLineReader读取压缩数据,并将文件输入流转换成解压数据流传递给普通SplitLineReader读取
in = new SplitLineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
//如果不是压缩文件,则创建普通SplitLineReader读取数据
in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);
filePosition = fileIn;
}
7、mapreduce的计数器
7.1、mapreduce框架自带计数器:
- Task group
- Inputformat group
- Outputformat group
- Framework group
7.2、用户自定义计数器
- 枚举方式
- 动态设置
public class MultiOutputs {
//通过枚举形式定义自定义计数器
enum MyCounter{MALFORORMED,NORMAL}
static class CommaMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(",");
for (String word : words) {
context.write(new Text(word), new LongWritable(1));
}
//对枚举定义的自定义计数器加1
context.getCounter(MyCounter.MALFORORMED).increment(1);
//通过动态设置自定义计数器加1
context.getCounter("counterGroupa", "countera").increment(1);
}
}
计数器原理简述(由appmaster维护,是一个全局的)。
8、mapreduce的日志分析
日志存放位置
系统服务进程的日志,默认存放在hadoop安装目录下的logs目录中
9、多job串联
借助jobControl类来建立job间顺序和依赖关系;
示例:
ControlledJob cJob1 = new ControlledJob(job1.getConfiguration());
ControlledJob cJob2 = new ControlledJob(job2.getConfiguration());
ControlledJob cJob3 = new ControlledJob(job3.getConfiguration());
// 设置作业依赖关系
cJob2.addDependingJob(cJob1);
cJob3.addDependingJob(cJob2);
JobControl jobControl = new JobControl("RecommendationJob");
jobControl.addJob(cJob1);
jobControl.addJob(cJob2);
jobControl.addJob(cJob3);
cJob1.setJob(job1);
cJob2.setJob(job2);
cJob3.setJob(job3);
// 新建一个线程来运行已加入JobControl中的作业,开始进程并等待结束
Thread jobControlThread = new Thread(jobControl);
jobControlThread.start();
while (!jobControl.allFinished()) {
Thread.sleep(500);
}
jobControl.stop();
return 0;