3,394 total views, 7 views today
上篇《Greenplum数据加载方式(1) – insert 和 copy》介绍了Greenplum最普通、低效的数据导入方式:insert和copy。
为了提高数据导入效率,Greenplum引入了外部表。外部表基于gpfdist工具(类似于Oracle的sqlldr工具),其最大的优势是支持数据并发加载。
一、外部表介绍及原理所谓外部表,就是在数据库中只有表定义、没有数据,数据都存放在数据库之外的数据文件。greenplum可以对一个外部表执行正常的DML操作,当读取数据的时候,数据库从数据文件中加载数据。外部表支持在segment上并发地告诉从gpfdist导入数据,由于是从segment上导入数据,所以效率很高。
外部表示意图
外部表需要指定gpfdist的IP和端口,还要有详细的目录地址,文件名支持通配符匹配。可以编写多个gpfdist地址,但是总数不能超过总的segment数量,否则会报错。
GPDB提供两种外部表:可读外部表用于数据装载、可写外部表用于数据卸载。外部表可基于文件、亦可基于WEB,这两种都能实现可读、可写。
当一个查询使用一个常规的外部表,该外部表被认为是可重读的,因为在该查询期间数据是静态的。而对于WEB外部表,数据是不可重读的,因为在该查询的执行期间数据可能会发生变化。
可写外部表用以从数据库表中选择记录并输出到文件、命名管道或其他可执行程序。 比如,可以从GPDB中卸载数据并发送到一个可执行程序,该程序连接到其他数据库或者ETL工具并装载数据到其他地方。 可写外部表还可以用于输出到GPDB的并行MapReduce计算。
可写外部表被定义后,数据即可从数据库表中被选择并插入到该可写外部表。 可写外部表只允许INSERT操作 – SELECT、 UPDATE、 DELETE或TRUNCATE是不允许的。可写外部表输出数据到一个可执行程序,该程序要能够接受流输入数据。
在创建外部表的时候,可以指定分隔符、err表、指定允许出错的数据条数,以及源文件的编码等信息。
二、外部表语法
CREATE [READABLE] EXTERNAL TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION ('file://seghost[:port]/path/file' [, ...])
| ('gpfdist://filehost[:port]/file_pattern[#transform]'
| ('gpfdists://filehost[:port]/file_pattern[#transform]'
[, ...])
| ('gphdfs://hdfs_host[:port]/path/file')
FORMAT 'TEXT'
[( [HEADER]
[DELIMITER [AS] 'delimiter' | 'OFF']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CSV'
[( [HEADER]
[QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE NOT NULL column [, ...]]
[ESCAPE [AS] 'escape']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'AVRO'
| 'PARQUET'
| 'CUSTOM' (Formatter=)
[ ENCODING 'encoding' ]
[ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
[ROWS | PERCENT] ]
CREATE [READABLE] EXTERNAL WEB TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION ('http://webhost[:port]/path/file' [, ...])
| EXECUTE 'command' [ON ALL
| MASTER
| number_of_segments
| HOST ['segment_hostname']
| SEGMENT segment_id ]
FORMAT 'TEXT'
[( [HEADER]
[DELIMITER [AS] 'delimiter' | 'OFF']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CSV'
[( [HEADER]
[QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE NOT NULL column [, ...]]
[ESCAPE [AS] 'escape']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CUSTOM' (Formatter=)
[ ENCODING 'encoding' ]
[ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
[ROWS | PERCENT] ]
CREATE WRITABLE EXTERNAL TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION('gpfdist://outputhost[:port]/filename[#transform]'
| ('gpfdists://outputhost[:port]/file_pattern[#transform]'
[, ...])
| ('gphdfs://hdfs_host[:port]/path')
FORMAT 'TEXT'
[( [DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF'] )]
| 'CSV'
[([QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE QUOTE column [, ...]] ]
[ESCAPE [AS] 'escape'] )]
| 'AVRO'
| 'PARQUET'
| 'CUSTOM' (Formatter=)
[ ENCODING 'write_encoding' ]
[ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]
CREATE WRITABLE EXTERNAL WEB TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
EXECUTE 'command' [ON ALL]
FORMAT 'TEXT'
[( [DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF'] )]
| 'CSV'
[([QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE QUOTE column [, ...]] ]
[ESCAPE [AS] 'escape'] )]
| 'CUSTOM' (Formatter=)
[ ENCODING 'write_encoding' ]
[ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
CREATE[READABLE]EXTERNALTABLEtable_name
(column_namedata_type[,...]|LIKEother_table)
LOCATION('file://seghost[:port]/path/file'[,...])
|('gpfdist://filehost[:port]/file_pattern[#transform]'
|('gpfdists://filehost[:port]/file_pattern[#transform]'
[,...])
|('gphdfs://hdfs_host[:port]/path/file')
FORMAT'TEXT'
[([HEADER]
[DELIMITER[AS]'delimiter'|'OFF']
[NULL[AS]'null string']
[ESCAPE[AS]'escape'|'OFF']
[NEWLINE[AS]'LF'|'CR'|'CRLF']
[FILLMISSINGFIELDS])]
|'CSV'
[([HEADER]
[QUOTE[AS]'quote']
[DELIMITER[AS]'delimiter']
[NULL[AS]'null string']
[FORCENOTNULLcolumn[,...]]
[ESCAPE[AS]'escape']
[NEWLINE[AS]'LF'|'CR'|'CRLF']
[FILLMISSINGFIELDS])]
|'AVRO'
|'PARQUET'
|'CUSTOM'(Formatter=)
[ENCODING'encoding']
[[LOGERRORS[INTOerror_table]]SEGMENTREJECTLIMITcount
[ROWS|PERCENT]]
CREATE[READABLE]EXTERNALWEBTABLEtable_name
(column_namedata_type[,...]|LIKEother_table)
LOCATION('http://webhost[:port]/path/file'[,...])
|EXECUTE'command'[ONALL
|MASTER
|number_of_segments
|HOST['segment_hostname']
|SEGMENTsegment_id]
FORMAT'TEXT'
[([HEADER]
[DELIMITER[AS]'delimiter'|'OFF']
[NULL[AS]'null string']
[ESCAPE[AS]'escape'|'OFF']
[NEWLINE[AS]'LF'|'CR'|'CRLF']
[FILLMISSINGFIELDS])]
|'CSV'
[([HEADER]
[QUOTE[AS]'quote']
[DELIMITER[AS]'delimiter']
[NULL[AS]'null string']
[FORCENOTNULLcolumn[,...]]
[ESCAPE[AS]'escape']
[NEWLINE[AS]'LF'|'CR'|'CRLF']
[FILLMISSINGFIELDS])]
|'CUSTOM'(Formatter=)
[ENCODING'encoding']
[[LOGERRORS[INTOerror_table]]SEGMENTREJECTLIMITcount
[ROWS|PERCENT]]
CREATEWRITABLEEXTERNALTABLEtable_name
(column_namedata_type[,...]|LIKEother_table)
LOCATION('gpfdist://outputhost[:port]/filename[#transform]'
|('gpfdists://outputhost[:port]/file_pattern[#transform]'
[,...])
|('gphdfs://hdfs_host[:port]/path')
FORMAT'TEXT'
[([DELIMITER[AS]'delimiter']
[NULL[AS]'null string']
[ESCAPE[AS]'escape'|'OFF'])]
|'CSV'
[([QUOTE[AS]'quote']
[DELIMITER[AS]'delimiter']
[NULL[AS]'null string']
[FORCEQUOTEcolumn[,...]]]
[ESCAPE[AS]'escape'])]
|'AVRO'
|'PARQUET'
|'CUSTOM'(Formatter=)
[ENCODING'write_encoding']
[DISTRIBUTEDBY(column,[...])|DISTRIBUTEDRANDOMLY]
CREATEWRITABLEEXTERNALWEBTABLEtable_name
(column_namedata_type[,...]|LIKEother_table)
EXECUTE'command'[ONALL]
FORMAT'TEXT'
[([DELIMITER[AS]'delimiter']
[NULL[AS]'null string']
[ESCAPE[AS]'escape'|'OFF'])]
|'CSV'
[([QUOTE[AS]'quote']
[DELIMITER[AS]'delimiter']
[NULL[AS]'null string']
[FORCEQUOTEcolumn[,...]]]
[ESCAPE[AS]'escape'])]
|'CUSTOM'(Formatter=)
[ENCODING'write_encoding']
[DISTRIBUTEDBY(column,[...])|DISTRIBUTEDRANDOMLY]
三、创建外部表的步骤<1>在文件/ETL服务器上启动gpfdist服务器,指定文件目录和端口。
只需将gpfdist工件拷贝到相关服务器上,在环境变量中加上$PATH变量即可使用。
语法:
gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout]
[-S] [-w time] [-v | -V] [-s] [-m max_length] [--ssl certificate_path]
gpfdist -? | --help
gpfdist --version
1
2
3
4
gpfdist[-ddirectory][-phttp_port][-llog_file][-ttimeout]
[-S][-wtime][-v|-V][-s][-mmax_length][--sslcertificate_path]
gpfdist-?|--help
gpfdist--version
启动gpfdist步骤
$ nohup gpfdist -d /home/gpadmin/script -p 8081 -l /home/gpadmin/script/gpfdist.log &
1
$nohupgpfdist-d/home/gpadmin/script-p8081-l/home/gpadmin/script/gpfdist.log&
查看是否已启动
$ ps -ef|grep gpfdist|grep -v "grep"
gpadmin 13193 1 0 Dec22 ? 00:00:33 gpfdist -d /home/gpadmin/script -p 8081 -l /home/gpadmin/script/gpfdist.log
1
2
$ps-ef|grepgpfdist|grep-v"grep"
gpadmin1319310Dec22?00:00:33gpfdist-d/home/gpadmin/script-p8081-l/home/gpadmin/script/gpfdist.log
停止gpfdist服务
$ kill xxx
1
$killxxx
<2>准备好需要加载的数据文件,创建外部表。这里最后两行(21和22行)是错误数据。
$ more member.txt
10 , jj
11 , kk
2 , tom
3 , cc
4 , dd
5 , ee
6 , ff
7 , gg
8 , hh
9 , ii
10 , jj
11 , kk
2 , tom
3 , cc
4 , dd
5 , ee
6 , ff
7 , gg
8 , hh
9 , ii
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$moremember.txt
10,jj
11,kk
2,tom
3,cc
4,dd
5,ee
6,ff
7,gg
8,hh
9,ii
10,jj
11,kk
2,tom
3,cc
4,dd
5,ee
6,ff
7,gg
8,hh
9,ii
10
<3>创建外部表t1
create external table public.t1
(
id int,
name varchar(200)
)
location ('gpfdist://10.9.15.18:8081/member.txt')
format 'text'
(delimiter ',' null as '' escape 'off')
encoding 'utf8'
log errors into t1_err segment reject limit 3 rows;
1
2
3
4
5
6
7
8
9
10
createexternaltablepublic.t1
(
idint,
namevarchar(200)
)
location('gpfdist://10.9.15.18:8081/member.txt')
format'text'
(delimiter','nullas''escape'off')
encoding'utf8'
logerrorsintot1_errsegmentrejectlimit3rows;
参数说明:
location 文件所在位置,可以直接是本地路径、gpfdist地址、gpfdists地址、gphdfs地址。
format 文本类型
delimiter 分隔符
encoding 编码
log error into 错误数据表,记录错误数据,会自动创建。一般都是tablename_err格式,例如t1_err。
segment reject limit 错误数据的条数/百分比(rows/percent),超过设置值会报错。最小值是2。用来确保数据的完整性。
<4>查询外部表t1
# select * from t1;
NOTICE: Found 2 data formatting errors (2 or more input rows). Rejected related input data.
id | name
----+------
10 | jj
11 | kk
2 | tom
3 | cc
4 | dd
5 | ee
6 | ff
7 | gg
8 | hh
9 | ii
10 | jj
11 | kk
2 | tom
3 | cc
4 | dd
5 | ee
6 | ff
7 | gg
8 | hh
9 | ii
(20 rows)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# select * from t1;
NOTICE:Found2dataformattingerrors(2ormoreinputrows).Rejectedrelatedinputdata.
id|name
----+------
10|jj
11|kk
2|tom
3|cc
4|dd
5|ee
6|ff
7|gg
8|hh
9|ii
10|jj
11|kk
2|tom
3|cc
4|dd
5|ee
6|ff
7|gg
8|hh
9|ii
(20rows)
查询t1的日志表(t1_err),可以看到该外部表的错误信息:21和22行数据缺失了name列。
Greenplum外部表的错误日志表格式:tablename_err 。
testDB=# select * from t1_err;
cmdtime | relname | filename | linenum | bytenum | errmsg | rawdata | rawbytes
-------------------------------+---------+------------------------------------------------------------------------+---------+---------+--------------------------------+---------+----------
2015-12-23 16:56:41.945765+08 | t1 | gpfdist://10.9.15.18:8081/member.txt [/home/gpadmin/script/member.txt] | 21 | | missing data for column "name" | |
2015-12-23 16:56:41.945765+08 | t1 | gpfdist://10.9.15.18:8081/member.txt [/home/gpadmin/script/member.txt] | 22 | | missing data for column "name" | 10 |
(2 rows)
1
2
3
4
5
6
testDB=# select * from t1_err;
cmdtime|relname|filename|linenum|bytenum|errmsg|rawdata|rawbytes
-------------------------------+---------+------------------------------------------------------------------------+---------+---------+--------------------------------+---------+----------
2015-12-2316:56:41.945765+08|t1|gpfdist://10.9.15.18:8081/member.txt [/home/gpadmin/script/member.txt] | 21 | | missing data for column "name" | |
2015-12-2316:56:41.945765+08|t1|gpfdist://10.9.15.18:8081/member.txt [/home/gpadmin/script/member.txt] | 22 | | missing data for column "name" | 10 |
(2rows)
字段说明:
cmdtime 记录时间
relname 表名
filename 外部表对应的文件名,包含网络路径和绝对路径
linenum 文件行号
errms 错误信息
rawdata 错误数据的值
装载数据,如果外部文件中,数据的格式错误,装载时会提示错误行数。
# insert into public.t1_his select * from t1;
NOTICE: Found 2 data formatting errors (2 or more input rows). Rejected related input data.
1
2
# insert into public.t1_his select * from t1;
NOTICE:Found2dataformattingerrors(2ormoreinputrows).Rejectedrelatedinputdata.