Hadoop 自定义InputFormat实现自定义Split

上一篇文章中提到了如何进行RecordReader的重写,本篇文章就是来实现如何实现自定义split的大小

要解决的需求:

(1)一个文本中每一行都记录了一个文件的路径,

(2)要求处理路径对应的文件,但是因为文件量比较大,所以想进行分布式处理

(3)所以就对输入的文档进行预处理,读取前N行做为一个splits,但是没有实现,因为重写FileSplit不是太容易实现,就偷懒直接定义一个split的大小是1000个字节,这样就可以将输入的文档进行分片了。

直接贴代码:

InputFormat

[java] view plaincopy

     
      
    package an.hadoop.test; 
    import java.io.IOException; 
    import java.util.ArrayList; 
    import java.util.List; 
    import org.apache.commons.logging.Log;  
    import org.apache.commons.logging.LogFactory; 
    import org.apache.hadoop.fs.BlockLocation; 
    import org.apache.hadoop.fs.FileStatus; 
    import org.apache.hadoop.fs.FileSystem; 
    import org.apache.hadoop.fs.Path; 
    import org.apache.hadoop.io.LongWritable; 
    import org.apache.hadoop.io.Text; 
    import org.apache.hadoop.io.compress.CompressionCodec; 
    import org.apache.hadoop.io.compress.CompressionCodecFactory; 
    import org.apache.hadoop.mapreduce.InputFormat; 
    import org.apache.hadoop.mapreduce.InputSplit; 
    import org.apache.hadoop.mapreduce.JobContext; 
    import org.apache.hadoop.mapreduce.RecordReader; 
    import org.apache.hadoop.mapreduce.TaskAttemptContext; 
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
    import org.apache.hadoop.mapreduce.lib.input.FileSplit; 
    import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; 
    public class LineInputFormat extends FileInputFormat<LongWritable , Text> { 
        public long mySplitSize = 1000; 
         private static final Log LOG = LogFactory.getLog(FileInputFormat.class); 
          private static final double SPLIT_SLOP = 1.1;   // 10% slop 
         @Override 
          public RecordReader<LongWritable, Text>  
            createRecordReader(InputSplit split, 
                               TaskAttemptContext context) { 
            return new LineRecordReader(); //为什么不行呢  
          } 
        @Override 
        protected boolean isSplitable(JobContext context, Path file) { 
            CompressionCodec codec = 
            new CompressionCodecFactory(context.getConfiguration()).getCodec(file); 
            //return codec == null; 
            return true;//要求分片 
        } 
           
        @Override 
          public List<InputSplit> getSplits(JobContext job) throws IOException { 
            long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); 
            long maxSize = getMaxSplitSize(job); 
            // generate splits 
            List<InputSplit> splits = new ArrayList<InputSplit>(); //用以存放生成的split的   
            for (FileStatus file: listStatus(job)) {//filestatues是文件对应的信息,具体看对应的类 
              Path path = file.getPath(); 
              FileSystem fs = path.getFileSystem(job.getConfiguration()); 
              long length = file.getLen(); //得到文本的长度 
              BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length); //取得文件所在块的位置 
              if ((length != 0) && isSplitable(job, path)) { //如果文件不为空,并且可以分片的话就进行下列操作, 
                long blockSize = file.getBlockSize();// 
                //long splitSize = computeSplitSize(blockSize, minSize, maxSize); //split的大小Math.max(minSize, Math.min(maxSize, blockSize)); 
                //可以通过调整splitSize的大小来控制对应的文件块的大小,比如设置splitSize=100,那么就可以控制成每个split的大小 
                //但是问题是,我是要求按行进行处理的,虽然这样应该也可以按行进行切分了,不过却不能保证每个split对应的行数都是相等的 
                //一般情况是如果文件大于64M(32M)就会使用块大小来作为split 
                long splitSize = mySplitSize; 
                long bytesRemaining = length; //文本的长度 
                while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {//剩下的文本长度大于split大小的SPLIT_SLOP倍数 
                  int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);//找到对应block块中对应的第0个字符开始, 
                  splits.add(new FileSplit(path, length-bytesRemaining, splitSize,   
                                           blkLocations[blkIndex].getHosts()));  
                //这个是形成split的代码FileSplit(文件路径,0,split大小,host) 
                  //原始函数为 FileSplit(Path file, long start, long length, String[] hosts) { 
                  //但是应该可以通过重写FileSplit来实现对应的要求 
                  bytesRemaining -= splitSize; 
                } 
                if (bytesRemaining != 0) { 
                  splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,  
                             blkLocations[blkLocations.length-1].getHosts())); 
                } 
              } else if (length != 0) { 
                splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts())); 
              } else {  
                //Create empty hosts array for zero length files 
                splits.add(new FileSplit(path, 0, length, new String[0])); 
              } 
            } 
            LOG.debug("Total # of splits: " + splits.size()); 
            return splits; 
          } 
    } 

main类

[java] view plaincopy

    public class Test_multi { 
        public static void main(String[] args) throws Exception { 
            Configuration conf = new Configuration(); 
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); 
            if (otherArgs.length != 2) { 
              System.err.println("Usage: test_multi <in> <out>"); 
              System.exit(2); 
            } 
            Job job = new Job(conf, "test_multi"); 
            job.setJarByClass(Test_multi.class); 
            job.setMapperClass(MultiMapper.class); 
           // job.setInputFormatClass(LineInputFormat.class);//自定义了InputFormat 
            //job.setCombinerClass(IntSumReducer.class); 
            //job.setReducerClass(IntSumReducer.class); 
            job.setOutputKeyClass(Text.class); 
            job.setOutputValueClass(Text.class); 
            FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 
            //job.waitForCompletion(true); 
            System.exit(job.waitForCompletion(true) ? 0 : 1); 
          } 

然后看下一日志;

不使用自定义的InputFormat的处理结果是

[java] view plaincopy

    11/11/10 14:54:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 
    11/11/10 14:54:25 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 
    11/11/10 14:54:25 INFO input.FileInputFormat: Total input paths to process : 1 
    11/11/10 14:54:25 INFO mapred.JobClient: Running job: job_local_0001 
    11/11/10 14:54:25 INFO input.FileInputFormat: Total input paths to process : 1 
    11/11/10 14:54:26 INFO mapred.MapTask: io.sort.mb = 100 
    11/11/10 14:54:26 INFO mapred.JobClient:  map 0% reduce 0% 
    11/11/10 14:54:26 INFO mapred.MapTask: data buffer = 79691776/99614720 
    11/11/10 14:54:26 INFO mapred.MapTask: record buffer = 262144/327680 
    11/11/10 14:54:32 INFO mapred.LocalJobRunner:  
    11/11/10 14:54:33 INFO mapred.JobClient:  map 58% reduce 0% 
    11/11/10 14:54:34 INFO mapred.MapTask: Starting flush of map output 
    11/11/10 14:54:35 INFO mapred.LocalJobRunner:  
    11/11/10 14:54:35 INFO mapred.JobClient:  map 100% reduce 0% 
    11/11/10 14:54:35 INFO mapred.MapTask: Finished spill 0 
    11/11/10 14:54:35 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 
    11/11/10 14:54:35 INFO mapred.LocalJobRunner:  
    11/11/10 14:54:35 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 
    11/11/10 14:54:35 INFO mapred.LocalJobRunner:  
    11/11/10 14:54:35 INFO mapred.Merger: Merging 1 sorted segments 
    11/11/10 14:54:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2974 bytes 
    11/11/10 14:54:35 INFO mapred.LocalJobRunner:  
    11/11/10 14:54:36 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 
    11/11/10 14:54:36 INFO mapred.LocalJobRunner:  
    11/11/10 14:54:36 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now 
    11/11/10 14:54:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://an.local:9100/user/an/out2 
    11/11/10 14:54:36 INFO mapred.LocalJobRunner: reduce > reduce 
    11/11/10 14:54:36 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 
    11/11/10 14:54:36 INFO mapred.JobClient:  map 100% reduce 100% 
    11/11/10 14:54:36 INFO mapred.JobClient: Job complete: job_local_0001 
    11/11/10 14:54:36 INFO mapred.JobClient: Counters: 14 
    11/11/10 14:54:36 INFO mapred.JobClient:   FileSystemCounters 
    11/11/10 14:54:36 INFO mapred.JobClient:     FILE_BYTES_READ=35990 
    11/11/10 14:54:36 INFO mapred.JobClient:     HDFS_BYTES_READ=8052 
    11/11/10 14:54:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=72570 
    11/11/10 14:54:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2642 
    11/11/10 14:54:36 INFO mapred.JobClient:   Map-Reduce Framework 
    11/11/10 14:54:36 INFO mapred.JobClient:     Reduce input groups=165 
    11/11/10 14:54:36 INFO mapred.JobClient:     Combine output records=0 
    11/11/10 14:54:36 INFO mapred.JobClient:     Map input records=165 
    11/11/10 14:54:36 INFO mapred.JobClient:     Reduce shuffle bytes=0 
    11/11/10 14:54:36 INFO mapred.JobClient:     Reduce output records=165 
    11/11/10 14:54:36 INFO mapred.JobClient:     Spilled Records=330 
    11/11/10 14:54:36 INFO mapred.JobClient:     Map output bytes=2642 
    11/11/10 14:54:36 INFO mapred.JobClient:     Combine input records=0 
    11/11/10 14:54:36 INFO mapred.JobClient:     Map output records=165 
    11/11/10 14:54:36 INFO mapred.JobClient:     Reduce input records=165 

使用自定义的InputFormat的日志是:

[java] view plaincopy

    11/11/10 14:42:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 
    11/11/10 14:42:41 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 
    11/11/10 14:42:41 INFO input.FileInputFormat: Total input paths to process : 1 
    11/11/10 14:42:42 INFO mapred.JobClient: Running job: job_local_0001 
    11/11/10 14:42:42 INFO input.FileInputFormat: Total input paths to process : 1 
    11/11/10 14:42:42 INFO mapred.MapTask: io.sort.mb = 100 
    11/11/10 14:42:43 INFO mapred.JobClient:  map 0% reduce 0% 
    11/11/10 14:42:46 INFO mapred.MapTask: data buffer = 79691776/99614720 
    11/11/10 14:42:46 INFO mapred.MapTask: record buffer = 262144/327680 
    11/11/10 14:42:49 INFO mapred.MapTask: Starting flush of map output 
    11/11/10 14:42:49 INFO mapred.MapTask: Finished spill 0 
    11/11/10 14:42:49 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 
    11/11/10 14:42:49 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 
    11/11/10 14:42:49 INFO mapred.MapTask: io.sort.mb = 100 
    11/11/10 14:42:50 INFO mapred.MapTask: data buffer = 79691776/99614720 
    11/11/10 14:42:50 INFO mapred.MapTask: record buffer = 262144/327680 
    11/11/10 14:42:50 INFO mapred.JobClient:  map 100% reduce 0% 
    11/11/10 14:42:51 INFO mapred.MapTask: Starting flush of map output 
    11/11/10 14:42:51 INFO mapred.MapTask: Finished spill 0 
    11/11/10 14:42:51 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting 
    11/11/10 14:42:51 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:51 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done. 
    11/11/10 14:42:51 INFO mapred.MapTask: io.sort.mb = 100 
    11/11/10 14:42:51 INFO mapred.MapTask: data buffer = 79691776/99614720 
    11/11/10 14:42:51 INFO mapred.MapTask: record buffer = 262144/327680 
    11/11/10 14:42:53 INFO mapred.MapTask: Starting flush of map output 
    11/11/10 14:42:53 INFO mapred.MapTask: Finished spill 0 
    11/11/10 14:42:53 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting 
    11/11/10 14:42:53 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:53 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000002_0' done. 
    11/11/10 14:42:53 INFO mapred.MapTask: io.sort.mb = 100 
    11/11/10 14:42:53 INFO mapred.MapTask: data buffer = 79691776/99614720 
    11/11/10 14:42:53 INFO mapred.MapTask: record buffer = 262144/327680 
    11/11/10 14:42:54 INFO mapred.MapTask: Starting flush of map output 
    11/11/10 14:42:54 INFO mapred.MapTask: Finished spill 0 
    11/11/10 14:42:54 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting 
    11/11/10 14:42:54 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:54 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000003_0' done. 
    11/11/10 14:42:54 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:54 INFO mapred.Merger: Merging 4 sorted segments 
    11/11/10 14:42:54 INFO mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 2980 bytes 
    11/11/10 14:42:54 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:55 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 
    11/11/10 14:42:55 INFO mapred.LocalJobRunner:  
    11/11/10 14:42:55 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now 
    11/11/10 14:42:55 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://an.local:9100/user/an/out2 
    11/11/10 14:42:55 INFO mapred.LocalJobRunner: reduce > reduce 
    11/11/10 14:42:55 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 
    11/11/10 14:42:55 INFO mapred.JobClient:  map 100% reduce 100% 
    11/11/10 14:42:55 INFO mapred.JobClient: Job complete: job_local_0001 
    11/11/10 14:42:55 INFO mapred.JobClient: Counters: 14 
    11/11/10 14:42:55 INFO mapred.JobClient:   FileSystemCounters 
    11/11/10 14:42:55 INFO mapred.JobClient:     FILE_BYTES_READ=86081 
    11/11/10 14:42:55 INFO mapred.JobClient:     HDFS_BYTES_READ=40373 
    11/11/10 14:42:55 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=181846 
    11/11/10 14:42:55 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2642 
    11/11/10 14:42:55 INFO mapred.JobClient:   Map-Reduce Framework 
    11/11/10 14:42:55 INFO mapred.JobClient:     Reduce input groups=165 
    11/11/10 14:42:55 INFO mapred.JobClient:     Combine output records=0 
    11/11/10 14:42:55 INFO mapred.JobClient:     Map input records=165 
    11/11/10 14:42:55 INFO mapred.JobClient:     Reduce shuffle bytes=0 
    11/11/10 14:42:55 INFO mapred.JobClient:     Reduce output records=165 
    11/11/10 14:42:55 INFO mapred.JobClient:     Spilled Records=330 
    11/11/10 14:42:55 INFO mapred.JobClient:     Map output bytes=2642 
    11/11/10 14:42:55 INFO mapred.JobClient:     Combine input records=0 
    11/11/10 14:42:55 INFO mapred.JobClient:     Map output records=165 
    11/11/10 14:42:55 INFO mapred.JobClient:     Reduce input records=165 

从中可以看出第二个日志文件里面有四段这样的代码:

[java] view plaincopy

    11/11/10 14:42:42 INFO mapred.MapTask: io.sort.mb = 100 
    11/11/10 14:42:43 INFO mapred.JobClient:  map 0% reduce 0% 
    11/11/10 14:42:46 INFO mapred.MapTask: data buffer = 79691776/99614720 
    11/11/10 14:42:46 INFO mapred.MapTask: record buffer = 262144/327680 
    11/11/10 14:42:49 INFO mapred.MapTask: Starting flush of map output 
    11/11/10 14:42:49 INFO mapred.MapTask: Finished spill 0 

说明是被分成了四个split,分片是成功了。

下一个问题:

使用多文件输入,中间处理之后输出文件是跟输入文件同名的,只是不在同一个文件夹下面。

输入文件与输出文件一一对应

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值