streaming方式的CombineFileInputFormat实现

最新推荐文章于 2022-03-02 15:51:08 发布

lmc_wy

最新推荐文章于 2022-03-02 15:51:08 发布

阅读量1.8k

点赞数

分类专栏： hadoop 文章标签： path file api float class null

本文链接：https://blog.csdn.net/lmc_wy/article/details/8012669

版权

hadoop版本问题严重，0.21的streaming方式无法正确使用combinefileinputformat，修改部分源码，以及实现CombineFileLineRecordReader。
源码修改部分：hadoop-mapred-0.21.0.jar包里的org.apache.hadoop.mapred.lib.CombineFileInputFormat.java文件

streaming方式要求任务分片为org.apache.hadoop.mapred.InputSplit，而实际输入分片为org.apache.hadoop.mapreduce.lib.input.CombineFileSplit，所以需要转变分片类型。

public InputSplit[] getSplits(JobConf job, int numSplits) 
    throws IOException {
	  List<org.apache.hadoop.mapreduce.InputSplit> splits = super.getSplits(new Job(job));
	  int size = splits.size();
	  
	  if (splits.get(0) instanceof org.apache.hadoop.mapreduce.lib.input.CombineFileSplit)
	  {
		  InputSplit[] returnSplits = new InputSplit[size];
	      for(int i=0;i<size; i++)
	      {
		      org.apache.hadoop.mapreduce.lib.input.CombineFileSplit combineFileSplit = (org.apache.hadoop.mapreduce.lib.input.CombineFileSplit)splits.get(i);
		      Path[] paths = combineFileSplit.getPaths();
		      long[] starts = combineFileSplit.getStartOffsets();
		      long[] lengths = combineFileSplit.getLengths();
		      String[] locations = combineFileSplit.getLocations();
		  
		      returnSplits[i] = new CombineFile

最低0.47元/天解锁文章

lmc_wy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
streaming方式的CombineFileInputFormat实现

hadoop版本问题严重，0.21的streaming方式无法正确使用combinefileinputformat，修改部分源码，以及实现CombineFileLineRecordReader。源码修改部分：hadoop-mapred-0.21.0.jar包里的org.apache.hadoop.mapred.lib.CombineFileInputFormat.java文件streami
复制链接

扫一扫

专栏目录