如何拓展Hadoop的InputFormat为其他分隔符

最新推荐文章于 2021-08-09 15:37:11 发布

追寻北极

最新推荐文章于 2021-08-09 15:37:11 发布

阅读量646

点赞数

分类专栏： pm

pm 专栏收录该内容

79 篇文章 15 订阅

订阅专栏

在Hadoop中，常用的TextInputFormat是以换行符作为Record分隔符的。

在实际应用中，我们经常会出现一条Record中包含多行的情况，例如：

此时，需要拓展TextInputFormat以完成这个功能。

先来看一下原始实现：

Java
 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
        public 
          
        class 
          
        TextInputFormat  
        extends 
          
        FileInputFormat 
        < 
        LongWritable 
        , 
          
        Text 
        > 
          
        { 
       
        @Override 
       
        public 
          
        RecordReader 
        < 
        LongWritable 
        , 
          
        Text 
        > 
       
        createRecordReader 
        ( 
        InputSplit  
        split 
        , 
       
        TaskAttemptContext  
        context 
        ) 
          
        { 
       
        // By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file) 
       
        String 
          
        delimiter 
          
        = 
          
        context 
        . 
        getConfiguration 
        ( 
        ) 
        . 
        get 
        ( 
       
        "textinputformat.record.delimiter" 
        ) 
        ; 
       
        byte 
        [ 
        ] 
          
        recordDelimiterBytes 
          
        = 
          
        null 
        ; 
       
        if 
          
        ( 
        null 
          
        != 
          
        delimiter 
        ) 
       
        recordDelimiterBytes 
          
        = 
          
        delimiter 
        . 
        getBytes 
        ( 
        ) 
        ; 
       
        return 
          
        new 
          
        LineRecordReader 
        ( 
        recordDelimiterBytes 
        ) 
        ; 
       
        } 
       
        @Override 
       
        protected 
          
        boolean 
          
        isSplitable 
        ( 
        JobContext  
        context 
        , 
          
        Path  
        file 
        ) 
          
        { 
       
        CompressionCodec  
        codec 
          
        = 
       
        new 
          
        CompressionCodecFactory 
        ( 
        context 
        . 
        getConfiguration 
        ( 
        ) 
        ) 
        . 
        getCodec 
        ( 
        file 
        ) 
        ; 
       
        return 
          
        codec 
          
        == 
          
        null 
        ; 
       
        } 
       
        }

根据上面的代码，不难发现，换行符实际上是由”textinputformat.record.delimiter”这个配置决定的。

所以我们有种解决方案：
(1) 在Job中直接配置textinputformat.record.delimiter为”</doc>\n”，这种方案是比较Hack的，很容易影响到其他代码的正常执行。
(2) 继承TextInputFormat，在return LineRecordReader时，使用自定义的分隔符。

本文采用第二种方案，代码如下：

 
Java
 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
        public 
          
        class 
          
        DocInputFormat 
          
        extends 
          
        TextInputFormat 
          
        { 
       
        private 
          
        static 
          
        final 
          
        String 
          
        RECORD_DELIMITER 
          
        = 
          
        "</doc>\n" 
        ; 
       
        @Override 
       
        public 
          
        RecordReader 
        < 
        LongWritable 
        , 
          
        Text 
        > 
          
        createRecordReader 
        ( 
       
        InputSplit  
        split 
        , 
          
        TaskAttemptContext  
        tac 
        ) 
          
        { 
       
        byte 
        [ 
        ] 
          
        recordDelimiterBytes 
          
        = 
          
        null 
        ; 
       
        recordDelimiterBytes 
          
        = 
          
        RECORD_DELIMITER 
        . 
        getBytes 
        ( 
        ) 
        ; 
       
        return 
          
        new 
          
        LineRecordReader 
        ( 
        recordDelimiterBytes 
        ) 
        ; 
       
        } 
       
        @Override 
       
        public 
          
        boolean 
          
        isSplitable 
        ( 
        JobContext  
        context 
        , 
          
        Path  
        file 
        ) 
          
        { 
       
        CompressionCodec  
        codec 
          
        = 
          
        new 
          
        CompressionCodecFactory 
        ( 
       
        context 
        . 
        getConfiguration 
        ( 
        ) 
        ) 
        . 
        getCodec 
        ( 
        file 
        ) 
        ; 
       
        return 
          
        codec 
          
        == 
          
        null 
        ; 
       
        } 
       
        }