gpfdist读取数据常见报错总结

1.Gpfdist Introduction to use

a. Grammer

-d  <directory>   可以指定工作目录,如果没指定,则为当前目录
-l  <log_file>    指定 log 文件,如果没指定,则直接输出到屏幕中
-p  <http_port>   指定服务端口,默认是 8080
-m  <max_length>  指定最大一行数据的大小,单位是 byte,默认是 32768, 即 32K,可配范围是 32K ~ 256M
-S  <use O_SYNC>  写入文件的时候,同步等待数据写入至存盘后再返回
-v                      显示详细信息
-V                显示更详细信息,当使用这个 V 时,上面的 v 也会被显示出来
-s                    不显示头信息   (这个在 --help 中没有显示,但代码里面是支持的,可以用)
-c                指定一个配置文件,用来执行数据转换的 (这个在 --help 中没有显示,但代码是支持的)
--ssl <certificate_path>   指定 ssl 加密

b. Usage example

nohup gpfdist -d /data/ -p 9236 -m 33554432 -l /root/logs/gpfdist_9236.log &

2.External table

a. Grammer

CREATE READABLE EXTERNAL TABLE table_name     
    ( column_name data_type [, ...] | LIKE other_table )
     LOCATION ('gpfdist://filehost[:port]/file_pattern[#transform=trans_name]'
           [, ...])
     FORMAT 'TEXT' 
           [( [HEADER]
              [DELIMITER [AS] 'delimiter' | 'OFF']
              [NULL [AS] 'null string']
              [ESCAPE [AS] 'escape' | 'OFF']
              [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
              [FILL MISSING FIELDS] )]
          | 'CSV'
           [( [HEADER]
              [QUOTE [AS] 'quote'] 
              [DELIMITER [AS] 'delimiter']
              [NULL [AS] 'null string']
              [FORCE NOT NULL column [, ...]]
              [ESCAPE [AS] 'escape']
              [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
              [FILL MISSING FIELDS] )]
          | 'ORC'
          | 'CUSTOM' (Formatter=<formatter_specifications>)
    [ ENCODING 'encoding' ]
      [ [LOG ERRORS INTO <error_table> SEGMENT REJECT LIMIT <count>
      [ROWS | PERCENT] ]

b. Parameter Explain

NameExplain
HEADERindicates whether the text file contains a header
DELIMITERrepresents the field separator of the file and supports multi-byte separators
NULLmeans that a value in the text file is null
ESCAPErepresents the escape character in the file, which is ‘\’ by default
NEWLINEstands for line break, which currently only supports ‘LF’, ‘CR’, and ‘CRLF’, which can be automatically recognized without specifying

c. Usage example

create table schema.table(col1 int,col2 decimal(10,2),col3 text);
drop external table schema.table_ext;
create readable external table schema.table_ext(like schema.table)
location ('gpfdist://localhost:8080/test/1.csv')
format 'CSV' (DELIMITER E',' QUOTE E'\u0007' NEWLINE 'LF' NULL E'\\N') encoding 'UTF8'
log errors into schema.table_errortable segment reject limit 10 rows;

-bash-4.2$ cat 1.csv 
1,2.5,test"test
2,3.3,\N

3.Gpfdist error handling summary

a. date too long

First, check whether the field separator and line break are correct, and then check whether the maximum size of a single data piece exceeds the size specified by gpfdist -m, and if so, adjust the value of gpfdist -m and restart gpfdist

b. invalid input sytax for type numeric:“”

This scenario occurs in the data file, where "\N" stands for NULL.
In order to automatically parse "\N" in the table, "NULL E'\\N'" is set to the external table, but if the value of numeric corresponding to the foreign table is an empty string, this error will appear.
How to solve it, the external table changes the numeric type to the text type, adds a conversion when reading, and converts the empty string to null and imports it into the internal table.
For example: 
insert into schema.table select col1,cast (nullif(col2,'') as numeric(10,2)),col3 from schema.table_ext;

c. missing date for column “xxxx”

First, check whether the separators and line breaks are correct, and second, check whether the number of fields in the external table is the same as that in the actual data file,Finally,in the error table, view the error data and check whether any escape characters appear at the end of the text and the delimiter recognition fails

d. invalid byte sequence for encoding “UTF8”: 0xb1

Check whether the file encoding is UTF8, and the essential cause of this similar problem is that the file encoding is GB18080, which is inconsistent with the encoding of the actual external table.
To solve the problem, make sure that the created external table supports other non-UTF8 encodings, and then specify it.
There is another possibility, please refer to the number g

e. invalid byte sequence for encoding “UTF8”: 0x00

Oracle accepts strings with '\0' in the middle for storage, and automatically truncates the content when it is displayed on various interfaces.
So the fundamental solution is to truncate the '0' of the field in the program code along with the garbled characters.
For example:
str.trim().split('\u0000')[0];

f. unterminated CSV quoted field

" is the default quotation character in a CSV file. It quotes fields that contains delimiters and cannot be unbalanced。
Use the QUOTE option to specify a different quote character: like QUOTE E'\u0007'

g. The external table is garbled in Chinese

# https://www.cnblogs.com/Mrq-t/p/12795725.html
export LANG=zh_CN
  • 24
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值