转载: http://www.micmiu.com/bigdata/hive/hive-inputformat-string/
Hive默认创建的表字段分隔符为:\001(ctrl-A),也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY 指定其他字符,但是该语法只支持单个字符,如果你的分隔符是多个字符,则需要你自定义InputFormat来实现,本文就以简单的示例演示多个字符作为分隔符的实现。
[一]、开发环境
Hadoop 2.2.0
Hive 0.12.0
Java1.6+
Mac OSX 10.9.1
[二]、示例
1、准备演示数据 mydemosplit.txt
分隔符为:“|@^_^@| ”
2、编码实现InputFormat
MyDemoInputFormat.java
1
package com.micmiu.hive;
2
3
import java.io.IOException;
4
5
import org.apache.hadoop.io.LongWritable;
6
import org.apache.hadoop.io.Text;
7
import org.apache.hadoop.mapred.FileSplit;
8
import org.apache.hadoop.mapred.InputSplit;
9
import org.apache.hadoop.mapred.JobConf;
10
import org.apache.hadoop.mapred.LineRecordReader;
11
import org.apache.hadoop.mapred.RecordReader;
12
import org.apache.hadoop.mapred.Reporter;
13
import org.apache.hadoop.mapred.TextInputFormat;
14
15
/**
16
*
17
* hive 自定义分隔符 比如:|@^_^@|
18
*
20
* @create Feb 24, 2014 3:11:16 PM
21
* @version 1.0
22
*/
23
public class MyDemoInputFormat extends TextInputFormat {
24
25
@Override
26
public RecordReadergetRecordReader(
27
InputSplit genericSplit, JobConf job, Reporter reporter)
28
throws IOException {
29
reporter.setStatus(genericSplit.toString());
30
MyDemoRecordReader reader = new MyDemoRecordReader(
31
new LineRecordReader(job, (FileSplit) genericSplit));
32
return reader;
33
}
34
35
public static class MyDemoRecordReader implements
36
RecordReader{
37
38
LineRecordReader reader;
39
Text text;
40
41
public MyDemoRecordReader(LineRecordReader reader) {
42
this.reader = reader;
43
text = reader.createValue();
44
}
45
46
@Override
47
public void close() throws IOException {
48
reader.close();
49
}
50
51
@Override
52
public LongWritable createKey() {
53
return reader.createKey();
54
}
55
56
@Override
57
public Text createValue() {
58
return new Text();
59
}
60
61
@Override
62
public long getPos() throws IOException {
63
return reader.getPos();
64
}
65
66
@Override
67
public float getProgress() throws IOException {
68
return reader.getProgress();
69
}
70
71
@Override
72
public boolean next(LongWritable key, Text value) throws IOException {
73
while (reader.next(key, text)) {
75
String strReplace = text.toString().toLowerCase()
76
.replaceAll("\\|@\\^_\\^@\\|", "\001");
77
Text txtReplace = new Text();
78
txtReplace.set(strReplace);
79
value.set(txtReplace.getBytes(), 0, txtReplace.getLength());
80
return true;
81
82
}
83
return false;
84
}
85
}
86
}
ps: 自定义实现接口InputFormat 、RecordReader,具体可以参考源码中得Base64TextInputFormat.java
编译打成jar包后,把该jar包copy一份到/lib/目录下,需要退出并重新进入Hive CLI模式。
3、建表和导入数据
用参数 STORED AS file_format 建表:
1
hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
2
hive> desc micmiu_blog;
3
OK
4
author string None
5
category string None
6
url string None
7
Time taken: 0.05 seconds, Fetched: 3 row(s)
导入上面的数据文件,对比导入前后表中的数据:
1
hive>select * from micmiu_blog; OK
2
Time taken: 0.033 seconds
3
hive> LOAD DATA LOCAL INPATH'/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog;
4
Copying data fromfile:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
5
Copying file: file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
6
Loading data to table default.micmiu_blog
7
Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0]
8
OK
9
Time taken: 0.197 seconds
10
hive> select * from micmiu_blog; OK
18
Time taken: 0.053 seconds, Fetched: 7 row(s)
19
hive>
从上面的执行过程可以看出已经实现了自定义字符串作为分隔符。
本文介绍如何在Hive中自定义多字符分隔符,以'|@^_^@|'为例,通过编写自定义的InputFormat类`MyDemoInputFormat`和`MyDemoRecordReader`,实现数据导入时的字段分隔。步骤包括开发环境配置、自定义类编写、建表导入数据,并展示了导入前后数据的对比。
7799

被折叠的 条评论
为什么被折叠?



