Hive默认创建的表字段分隔符为:\001(ctrl-A),也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY
指定其他字符,但是该语法只支持单个字符,如果你的分隔符是多个字符,则需要你自定义InputFormat来实现,本文就以简单的示例演示多个字符作为分隔符的实现。
[一]、开发环境
- Hadoop 2.2.0
- Hive 0.12.0
- Java1.6+
- Mac OSX 10.9.1
[二]、示例
1、准备演示数据 mydemosplit.txt
1 | michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/ |
2 | michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/ |
3 | michael|@^_^@|groovy|@^_^@|http://www.micmiu.com/lang/groovy/groovy-running-ways/ |
4 | michael|@^_^@|sso|@^_^@|http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/ |
5 | michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/ |
6 | michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/spring/springmvc-binding- date / |
7 | michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/ |
分隔符为:“|@^_^@| ”
2、编码实现InputFormat
MyDemoInputFormat.java
1 | package com.micmiu.hive; |
2 |
3 | import java.io.IOException; |
4 |
5 | import org.apache.hadoop.io.LongWritable; |
6 | import org.apache.hadoop.io.Text; |
7 | import org.apache.hadoop.mapred.FileSplit; |
8 | import org.apache.hadoop.mapred.InputSplit; |
9 | import org.apache.hadoop.mapred.JobConf; |
10 | import org.apache.hadoop.mapred.LineRecordReader; |
11 | import org.apache.hadoop.mapred.RecordReader; |
12 | import org.apache.hadoop.mapred.Reporter; |
13 | import org.apache.hadoop.mapred.TextInputFormat; |
14 |
15 | /** |
16 | * |
17 | * hive 自定义分隔符 比如:|@^_^@| |
18 | * |
19 | * @author <a href="http://www.micmiu.com">Michael</a> |
20 | * @create Feb 24, 2014 3:11:16 PM |
21 | * @version 1.0 |
22 | */ |
23 | public class MyDemoInputFormat extends TextInputFormat { |
24 |
25 | @Override |
26 | public RecordReader<LongWritable, Text> getRecordReader( |
27 | InputSplit genericSplit, JobConf job, Reporter reporter) |
28 | throws IOException { |
29 | reporter.setStatus(genericSplit.toString()); |
30 | MyDemoRecordReader reader = new MyDemoRecordReader( |
31 | new LineRecordReader(job, (FileSplit) genericSplit)); |
32 | return reader; |
33 | } |
34 |
35 | public static class MyDemoRecordReader implements |
36 | RecordReader<LongWritable, Text> { |
37 |
38 | LineRecordReader reader; |
39 | Text text; |
40 |
41 | public MyDemoRecordReader(LineRecordReader reader) { |
42 | this .reader = reader; |
43 | text = reader.createValue(); |
44 | } |
45 |
46 | @Override |
47 | public void close() throws IOException { |
48 | reader.close(); |
49 | } |
50 |
51 | @Override |
52 | public LongWritable createKey() { |
53 | return reader.createKey(); |
54 | } |
55 |
56 | @Override |
57 | public Text createValue() { |
58 | return new Text(); |
59 | } |
60 |
61 | @Override |
62 | public long getPos() throws IOException { |
63 | return reader.getPos(); |
64 | } |
65 |
66 | @Override |
67 | public float getProgress() throws IOException { |
68 | return reader.getProgress(); |
69 | } |
70 |
71 | @Override |
72 | public boolean next(LongWritable key, Text value) throws IOException { |
73 | while (reader.next(key, text)) { |
74 | // michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/ |
75 | String strReplace = text.toString().toLowerCase() |
76 | .replaceAll( "\\|@\\^_\\^@\\|" , "\001" ); |
77 | Text txtReplace = new Text(); |
78 | txtReplace.set(strReplace); |
79 | value.set(txtReplace.getBytes(), 0 , txtReplace.getLength()); |
80 | return true ; |
81 |
82 | } |
83 | return false ; |
84 | } |
85 | } |
86 | } |
ps: 自定义实现接口InputFormat 、RecordReader,具体可以参考源码中得Base64TextInputFormat.java
编译打成jar包后,把该jar包copy一份到<HOME_HIVE>/lib/目录下,需要退出并重新进入Hive CLI模式。
3、建表和导入数据
用参数 STORED AS file_format
建表:
1 | hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
2 | hive> desc micmiu_blog; |
3 | OK |
4 | author string None |
5 | category string None |
6 | url string None |
7 | Time taken: 0.05 seconds, Fetched: 3 row(s) |
导入上面的数据文件,对比导入前后表中的数据:
1 | hive> select * from micmiu_blog; OK |
2 | Time taken: 0.033 seconds |
3 | hive> LOAD DATA LOCAL INPATH '/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog; |
4 | Copying data from file :/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt |
5 | Copying file : file :/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt |
6 | Loading data to table default.micmiu_blog |
7 | Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0] |
8 | OK |
9 | Time taken: 0.197 seconds |
10 | hive> select * from micmiu_blog; OK |
11 | michael hadoop http://www.micmiu.com/opensource/hadoop/hive-metastore-config/ |
12 | michael j2ee http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/ |
13 | michael groovy http://www.micmiu.com/lang/groovy/groovy-running-ways/ |
14 | michael sso http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/ |
15 | michael hadoop http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/ |
16 | michael j2ee http://www.micmiu.com/j2ee/spring/springmvc-binding- date / |
17 | michael hadoop http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/ |
18 | Time taken: 0.053 seconds, Fetched: 7 row(s) |
19 | hive> |
从上面的执行过程可以看出已经实现了自定义字符串作为分隔符。
本文链接地址: http://www.micmiu.com/bigdata/hive/hive-inputformat-string/