Hive默认创建的表字段分隔符为:\001(ctrl-A),也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY
指定其他字符,但是该语法只支持单个字符,如果你的分隔符是多个字符,则需要你自定义InputFormat来实现,本文就以简单的示例演示多个字符作为分隔符的实现。
[一]、开发环境
Hadoop 2.2.0 Hive 0.12.0 Java1.6+ Mac OSX 10.9.1
[二]、示例
1、准备演示数据 mydemosplit.txt
分隔符为 :“|@^_^@| ”
2、编码实现InputFormat
MyDemoInputFormat.java
1
package
com.micmiu.hive;
3
import
java.io.IOException;
5
import
org.apache.hadoop.io.LongWritable;
6
import
org.apache.hadoop.io.Text;
7
import
org.apache.hadoop.mapred.FileSplit;
8
import
org.apache.hadoop.mapred.InputSplit;
9
import
org.apache.hadoop.mapred.JobConf;
10
import
org.apache.hadoop.mapred.LineRecordReader;
11
import
org.apache.hadoop.mapred.RecordReader;
12
import
org.apache.hadoop.mapred.Reporter;
13
import
org.apache.hadoop.mapred.TextInputFormat;
17
* hive 自定义分隔符 比如:|@^_^@|
20
* @create Feb 24, 2014 3:11:16 PM
23
public
class
MyDemoInputFormat
extends
TextInputFormat {
26
public
RecordReader<LongWritable, Text> getRecordReader(
27
InputSplit genericSplit, JobConf job, Reporter reporter)
29
reporter.setStatus(genericSplit.toString());
30
MyDemoRecordReader reader =
new
MyDemoRecordReader(
31
new
LineRecordReader(job, (FileSplit) genericSplit));
35
public
static
class
MyDemoRecordReader
implements
36
RecordReader<LongWritable, Text> {
38
LineRecordReader reader;
41
public
MyDemoRecordReader(LineRecordReader reader) {
43
text = reader.createValue();
47
public
void
close()
throws
IOException {
52
public
LongWritable createKey() {
53
return
reader.createKey();
57
public
Text createValue() {
62
public
long
getPos()
throws
IOException {
63
return
reader.getPos();
67
public
float
getProgress()
throws
IOException {
68
return
reader.getProgress();
72
public
boolean
next(LongWritable key, Text value)
throws
IOException {
73
while
(reader.next(key, text)) {
75
String strReplace = text.toString().toLowerCase()
76
.replaceAll(
"\\|@\\^_\\^@\\|"
,
"\001"
);
77
Text txtReplace =
new
Text();
78
txtReplace.set(strReplace);
79
value.set(txtReplace.getBytes(),
0
, txtReplace.getLength());
ps: 自定义实现接口InputFormat 、RecordReader,具体可以参考源码中得Base64TextInputFormat.java
编译打成jar包后,把该jar包copy一份到<HOME_HIVE>/lib/ 目录下,需要退出并重新进入Hive CLI模式 。
3、建表和导入数据
用参数 STORED AS file_format
建表:
1
hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT
'com.micmiu.hive.MyDemoInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
7
Time taken: 0.05 seconds, Fetched: 3 row(s)
导入上面的数据文件,对比导入前后表中的数据:
1
hive>
select
* from micmiu_blog; OK
2
Time taken: 0.033 seconds
3
hive> LOAD DATA LOCAL INPATH
'/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt'
OVERWRITE INTO TABLE micmiu_blog;
4
Copying data from
file
:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
5
Copying
file
:
file
:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
6
Loading data to table default.micmiu_blog
7
Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0]
9
Time taken: 0.197 seconds
10
hive>
select
* from micmiu_blog; OK
18
Time taken: 0.053 seconds, Fetched: 7 row(s)
从上面的执行过程可以看出已经实现了自定义字符串作为分隔符。