hadoop自定义SdfInputFormat，文件按标记分片

最新推荐文章于 2022-03-29 19:16:06 发布

j3smile

最新推荐文章于 2022-03-29 19:16:06 发布

阅读量1.5k

点赞数

分类专栏： hadoop 文章标签： hadoop path codec eclipse file string

本文链接：https://blog.csdn.net/j3smile/article/details/7372208

版权

hadoop 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

由于要用hadoop streaming处理sdf文件，而sdf文件的文件格式为

1
  -OEChem-12181003042D
.....
$$$$

以$$$$结尾的多行。

而hadoop默认的分片为：以分块为基础的分片

    for (FileStatus file: files) {
      Path path = file.getPath();
      FileSystem fs = path.getFileSystem(job.getConfiguration());
      long length = file.getLen();
      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
      if ((length != 0) && isSplitable(job, path)) {
        long blockSize = file.getBlockSize();
        long splitSize = computeSplitSize(blockSize, minSize, maxSize);

        long bytesRemaining = length;
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
                     blkLocations[blkLocations.length-1].getHosts()));
        }
      } else if (length != 0) {
        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
      } else {
        //Create empty hosts array for zero length files
        splits.add(new FileSplit(path, 0, length, new String[0]));
      }
    }

本来想直接修改FileInputFormat.java的getSplits方法。可是本人刚刚接触java和hadoop，必须参考其他例子。于是找到NLineInputFormat的源码，其分片(getSplits方法)是按指定行划分的，有了点头绪。又想到TextInputFormat是按行读取的，想想应该在读取分片时做了些处理，果然是，它在已分片的基础上，跳过起始的最后一行，多读取分片的最后一行。

LineRecordReader.java：
    FileSystem fs = file.getFileSystem(job);
    FSDataInputStream fileIn = fs.open(split.getPath());
    boolean skipFirstLine = false;
    if (codec != null) {
      in = new LineReader(codec.createInputStream(fileIn), job);
      end = Long.MAX_VALUE;
    } else {
      if (start != 0) {
        skipFirstLine = true;
        --start;
        fileIn.seek(start);
      }
      in = new LineReader(fileIn, job);
    }
    if (skipFirstLine) {  // skip first line and re-establish "start".
      start += in.readLine(new Text(), 0,
                           (int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    this.pos = start;
  }

于是想到只要改写TextInputFormat就OK了。

下面是源代码的地址：源代码

运行环境是eclipse3.7 +hadoop-1.0.0+64位gentoo

j3smile

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop自定义SdfInputFormat，文件按标记分片

由于要用hadoop streaming处理sdf文件，而sdf文件的文件格式为1 -OEChem-12181003042D.....$$$$以$$$$结尾的多行。而hadoop默认的分片为：以分块为基础的分片 for (FileStatus file: files) { Path path = file.getPath();
复制链接

扫一扫