Hive 的自定义 Inputformat_hive自定义inputformat-CSDN博客

Hive默认创建的表字段分隔符为：\001(ctrl-A)，也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY 指定其他字符，但是该语法只支持单个字符，如果你的分隔符是多个字符，则需要你自定义InputFormat来实现，本文就以简单的示例演示多个字符作为分隔符的实现。

[一]、开发环境

Hadoop 2.2.0
Hive 0.12.0
Java1.6+
Mac OSX 10.9.1

[二]、示例

1、准备演示数据 mydemosplit.txt

 
michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
 
michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/
 
michael|@^_^@|groovy|@^_^@|http://www.micmiu.com/lang/groovy/groovy-running-ways/
 
michael|@^_^@|sso|@^_^@|http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/
 
michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/
 
michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/spring/springmvc-binding-date/
 
michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/

分隔符为：“|@^_^@| ”

2、编码实现InputFormat

MyDemoInputFormat.java

 
package com.micmiu.hive;
 
 
 
import java.io.IOException;
 
 
 
import org.apache.hadoop.io.LongWritable;
 
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapred.FileSplit;
 
import org.apache.hadoop.mapred.InputSplit;
 
import org.apache.hadoop.mapred.JobConf;
 
import org.apache.hadoop.mapred.LineRecordReader;
 
import org.apache.hadoop.mapred.RecordReader;
 
import org.apache.hadoop.mapred.Reporter;
 
import org.apache.hadoop.mapred.TextInputFormat;
 
 
 
/**
 
 *
 
 * hive 自定义分隔符 比如：|@^_^@|
 
 *
 
 * @author <a href="http://www.micmiu.com">Michael</a>
 
 * @create Feb 24, 2014 3:11:16 PM
 
 * @version 1.0
 
 */
 
public class MyDemoInputFormat extends TextInputFormat {
 
 
 
    @Override
 
    public RecordReader<LongWritable, Text> getRecordReader(
 
            InputSplit genericSplit, JobConf job, Reporter reporter)
 
            throws IOException {
 
        reporter.setStatus(genericSplit.toString());
 
        MyDemoRecordReader reader = new MyDemoRecordReader(
 
                new LineRecordReader(job, (FileSplit) genericSplit));
 
        return reader;
 
    }
 
 
 
    public static class MyDemoRecordReader implements
 
            RecordReader<LongWritable, Text> {
 
 
 
        LineRecordReader reader;
 
        Text text;
 
 
 
        public MyDemoRecordReader(LineRecordReader reader) {
 
            this.reader = reader;
 
            text = reader.createValue();
 
        }
 
 
 
        @Override
 
        public void close() throws IOException {
 
            reader.close();
 
        }
 
 
 
        @Override
 
        public LongWritable createKey() {
 
            return reader.createKey();
 
        }
 
 
 
        @Override
 
        public Text createValue() {
 
            return new Text();
 
        }
 
 
 
        @Override
 
        public long getPos() throws IOException {
 
            return reader.getPos();
 
        }
 
 
 
        @Override
 
        public float getProgress() throws IOException {
 
            return reader.getProgress();
 
        }
 
 
 
        @Override
 
        public boolean next(LongWritable key, Text value) throws IOException {
 
            while (reader.next(key, text)) {
 
                // michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
 
                String strReplace = text.toString().toLowerCase()
 
                        .replaceAll("\\|@\\^_\\^@\\|", "\001");
 
                Text txtReplace = new Text();
 
                txtReplace.set(strReplace);
 
                value.set(txtReplace.getBytes(), 0, txtReplace.getLength());
 
                return true;
 
 
 
            }
 
            return false;
 
        }
 
    }
 
}

ps: 自定义实现接口InputFormat 、RecordReader，具体可以参考源码中得Base64TextInputFormat.java

编译打成jar包后，把该jar包copy一份到<HOME_HIVE>/lib/目录下，需要退出并重新进入Hive CLI模式。

3、建表和导入数据

用参数 STORED AS file_format 建表：

 
hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
 
hive> desc micmiu_blog;
 
OK
 
author                  string                  None               
 
category                string                  None               
 
url                     string                  None               
 
Time taken: 0.05 seconds, Fetched: 3 row(s)

导入上面的数据文件,对比导入前后表中的数据：

 
hive>select * from  micmiu_blog;                                           OK                                      
 
Time taken: 0.033 seconds
 
hive> LOAD DATA LOCAL INPATH'/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog;
 
Copying data fromfile:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
 
Copying file: file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
 
Loading data to table default.micmiu_blog
 
Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0]
 
OK
 
Time taken: 0.197 seconds
 
hive> select * from  micmiu_blog;                                                 OK                                      
 
michael hadoop  http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
 
michael j2ee    http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/
 
michael groovy  http://www.micmiu.com/lang/groovy/groovy-running-ways/
 
michael sso http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/
 
michael hadoop  http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/
 
michael j2ee    http://www.micmiu.com/j2ee/spring/springmvc-binding-date/
 
michael hadoop  http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/
 
Time taken: 0.053 seconds, Fetched: 7 row(s)
 
hive>