Hive 的 自定义 Inputformat

Hive默认创建的表字段分隔符为:\001(ctrl-A),也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY 指定其他字符,但是该语法只支持单个字符,如果你的分隔符是多个字符,则需要你自定义InputFormat来实现,本文就以简单的示例演示多个字符作为分隔符的实现。

[一]、开发环境

  • Hadoop 2.2.0
  • Hive 0.12.0
  • Java1.6+
  • Mac OSX 10.9.1

[二]、示例

1、准备演示数据 mydemosplit.txt

分隔符为:“|@^_^@| ”

2、编码实现InputFormat

MyDemoInputFormat.java

1 package com.micmiu.hive;
2  
3 import java.io.IOException;
4  
5 import org.apache.hadoop.io.LongWritable;
6 import org.apache.hadoop.io.Text;
7 import org.apache.hadoop.mapred.FileSplit;
8 import org.apache.hadoop.mapred.InputSplit;
9 import org.apache.hadoop.mapred.JobConf;
10 import org.apache.hadoop.mapred.LineRecordReader;
11 import org.apache.hadoop.mapred.RecordReader;
12 import org.apache.hadoop.mapred.Reporter;
13 import org.apache.hadoop.mapred.TextInputFormat;
14  
15 /**
16  *
17  * hive 自定义分隔符 比如:|@^_^@|
18  *
19  * @author <a href="http://www.micmiu.com">Michael</a>
20  * @create Feb 24, 2014 3:11:16 PM
21  * @version 1.0
22  */
23 public class MyDemoInputFormat extends TextInputFormat {
24  
25     @Override
26     public RecordReader<LongWritable, Text> getRecordReader(
27             InputSplit genericSplit, JobConf job, Reporter reporter)
28             throws IOException {
29         reporter.setStatus(genericSplit.toString());
30         MyDemoRecordReader reader = new MyDemoRecordReader(
31                 new LineRecordReader(job, (FileSplit) genericSplit));
32         return reader;
33     }
34  
35     public static class MyDemoRecordReader implements
36             RecordReader<LongWritable, Text> {
37  
38         LineRecordReader reader;
39         Text text;
40  
41         public MyDemoRecordReader(LineRecordReader reader) {
42             this.reader = reader;
43             text = reader.createValue();
44         }
45  
46         @Override
47         public void close() throws IOException {
48             reader.close();
49         }
50  
51         @Override
52         public LongWritable createKey() {
53             return reader.createKey();
54         }
55  
56         @Override
57         public Text createValue() {
58             return new Text();
59         }
60  
61         @Override
62         public long getPos() throws IOException {
63             return reader.getPos();
64         }
65  
66         @Override
67         public float getProgress() throws IOException {
68             return reader.getProgress();
69         }
70  
71         @Override
72         public boolean next(LongWritable key, Text value) throws IOException {
73             while (reader.next(key, text)) {
74                 // michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
75                 String strReplace = text.toString().toLowerCase()
76                         .replaceAll("\\|@\\^_\\^@\\|""\001");
77                 Text txtReplace = new Text();
78                 txtReplace.set(strReplace);
79                 value.set(txtReplace.getBytes(), 0, txtReplace.getLength());
80                 return true;
81  
82             }
83             return false;
84         }
85     }
86 }

ps: 自定义实现接口InputFormat 、RecordReader,具体可以参考源码中得Base64TextInputFormat.java

编译打成jar包后,把该jar包copy一份到<HOME_HIVE>/lib/目录下,需要退出并重新进入Hive CLI模式

3、建表和导入数据

用参数 STORED AS file_format 建表:

1 hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
2 hive> desc micmiu_blog;
3 OK
4 author                  string                  None               
5 category                string                  None               
6 url                     string                  None               
7 Time taken: 0.05 seconds, Fetched: 3 row(s)

导入上面的数据文件,对比导入前后表中的数据:

1 hive>select * from  micmiu_blog;                                           OK                                      
2 Time taken: 0.033 seconds
3 hive> LOAD DATA LOCAL INPATH'/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog;
4 Copying data fromfile:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
5 Copying filefile:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
6 Loading data to table default.micmiu_blog
7 Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0]
8 OK
9 Time taken: 0.197 seconds
10 hive> select * from  micmiu_blog;                                                 OK                                      
18 Time taken: 0.053 seconds, Fetched: 7 row(s)
19 hive>

从上面的执行过程可以看出已经实现了自定义字符串作为分隔符。


本文链接地址: http://www.micmiu.com/bigdata/hive/hive-inputformat-string/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值