gpss安装
主要有以下三种方式:
- GPSS gppkg Installer - 升级Greenplum数据库集群中所有节点上的GPSS 。
- GPSS ETL Installer - 在没有安装Greenplum数据库的专用ETL服务器主机上安装或升级GPSS 。
- GPSS tarball - 在一个已经安装Greenplum数据库的专用ETL服务器主机上安装或升级GPSS
这里我选的是第二种:
GPSS版本:gpss-gpdb6-1.4.2-rhel6-x86_64.rpm
安装方式:GPSS ETL
安装机器:etlHost
目标数据库:gpHost
目标数据库版本:greenplum-db-6.7.0-rhel6-x86_64
Kafka主机:kafkaHost
Kafka版本:kafka_2.11-2.2.0
系统版本:CentOS6.5(全部)
安装包下载:
https://network.pivotal.io/products/greenplum-streaming-server#/releases/770954/file_groups/2939
将安装包下载到etlHost主机,执行安装:
yum install gpss-gpdb6-1.4.2-rhel6-x86_64.rpm
注:GPSS ETL工具会被安装到/usr/local/gpss-<version>目录,同时会创建一个名为 /usr/local/gpss的软链接到当前目录。
初始化环境:
source /usr/local/gpss/gpss_path.sh
测试是否安装成功:
gpss –help
gpss command starts a greenplum stream server that
include a gpfdist sever and a rpc server
Usage:
gpss <config file> [flags]
Flags:
--debug-port string enable pprof debug server at specified port
-h, --help help for gpss
-l, --log-dir string log directory, default is $HOME/gpAdminLogs
--verbose enable debug log
--version version for gpss
登录到kafkaHost机器,并进切换到kafka所在目录:
cd /home/app/kafka_2.11-2.2.0/
在Kafka中创建名为topic_json_gpkafka的topic:
./bin/kafka-topics.sh --create \
--zookeeper localhost:2181 --replication-factor 1 --partitions 1 \
--topic topic_json_gpkafka
编辑并保存文本sample_data.json:
vi /home/app/public/sample_data.json
{ "cust_id": 1313131, "month": 12, "expenses": 1313.13 }
{ "cust_id": 3535353, "month": 11, "expenses": 761.35 }
{ "cust_id": 7979797, "month": 10, "expenses": 4489.00 }
{ "cust_id": 7979797, "month": 11, "expenses": 18.72 }
{ "cust_id": 3535353, "month": 10, "expenses": 6001.94 }
{ "cust_id": 7979797, "month": 12, "expenses": 173.18 }
{ "cust_id": 1313131, "month": 10, "expenses": 492.83 }
{ "cust_id": 3535353, "month": 12, "expenses": 81.12 }
{ "cust_id": 1313131, "month": 11, "expenses": 368.27 }
将sample_data.json文件的内容发布到Kafka主题topic_json_gpkafka:
./bin/kafka-console-producer.sh \
--broker-list kafkaHost:9092 \
--topic topic_json_gpkafka < /home/app/public/sample_data.json
验证消息是否发布到该主题:
./bin/kafka-console-consumer.sh \
--bootstrap-server kafkaHost:9092 --topic topic_json_gpkafka \
--from-beginning
在目标Greenplum数据库gpHost创建表json_from_kafka:
#su – gpadmin
$ psql -d testdb
testdb=# CREATE TABLE testgpss.json_from_kafka( customer_id int8, month int4, amount_paid decimal(9,2) );
在etlHost主机上配置gpss.json文件:
#cd /home/app/public
#vi gpss.json
{
"ListenAddress": {
"Host": "etlHost",
"Port": 5019
},
"Gpfdist": {
"Host": "etlHost",
"Port": 8319,
"ReuseTables": false,
"BindAddress": "etlHost"
},
"Shadow": {
"Key": "a_very_secret_key"
}
}
Shadowing Greenplum数据库密码:
在ETL系统上运行gpsscli shadow命令以交互方式生成shadowed密码
#gpsscli shadow --config gpss.json
please input your password
gpadmin
20201126 18:22:35.47730,168926,info,librdkafka version: 1.4.2(10402ff)
"SHADOW:RIZD2EBCAWH62FATYXJLELZYJM"
记下密码"SHADOW:RIZD2EBCAWH62FATYXJLELZYJM",后续如果以shadow启动GPSS启动服务的时候需要,yaml文件的greenplum密码需要填写编码后的密码。例如:
当 gpss.json的Shadow配置如下:
"Shadow": {
"Key": "a_very_secret_key"
}
yaml文件对应的配置如下:
PASSWORD: "SHADOW:RIZD2EBCAWH62FATYXJLELZYJM"
⚠Shadow的方式不适用于gpkafka命令行工具。
在etlHost主机上启动GPSS服务:
gpss gpss.json --log-dir ./gpsslogs
在ETL主机上编辑YAML文件jsonload_cfg.yaml:
#vi jsonload_cfg.yaml
DATABASE: testdb
USER: gpadmin
PASSWORD: "SHADOW:RIZD2EBCAWH62FATYXJLELZYJM"
HOST: gpHost
PORT: 5432
KAFKA:
INPUT:
SOURCE:
BROKERS: kafkaHost:9092
TOPIC: topic_json_gpkafka
COLUMNS:
- NAME: jdata
TYPE: json
FORMAT: json
ERROR_LIMIT: 10
OUTPUT:
SCHEMA: testgpss
TABLE: json_from_kafka
MAPPING:
- NAME: customer_id
EXPRESSION: (jdata->>'cust_id')::int
- NAME: month
EXPRESSION: (jdata->>'month')::int
- NAME: amount_paid
EXPRESSION: (jdata->>'expenses')::decimal
COMMIT:
MINIMAL_INTERVAL: 2000
将Kafka数据加载作业提交到在端口号5019上运行的GPSS实例。在etlHost主机上打开一个新的窗口,提交一个名为kafkajson2gp的作业:
#cd /home/app/public
#source /usr/local/gpss/gpss_path.sh
#gpsscli submit --name kafkajson2gp --gpss-host etlHost --gpss-port 5019 ./jsonload_cfg.yaml
20201127 14:49:41.98658,17487,info,JobID: 7c57fce7d134a81cad1b7794b4a6f8c5,JobName: kafkajson2gp
列出所有GPSS作业:
#gpsscli list --all --gpss-host etlHost --gpss-port 5019
JobName JobID GPHost GPPort DataBase Schema Table Topic Status
kafkajson2gp 7c57fce7d134a81cad1b7794b4a6f8c5 gpHost 5432 testdb testgpss json_from_kafka topic_json_gpkafkaJOB_STOPPED
list子命令显示所有作业。刚刚提交的kafkajson2gp作业处于“已停止”状态。
开始kafkajson2gp作业:
#gpsscli start kafkajson2gp --gpss-host etlHost --gpss-port 5019
20201127 15:06:39.05403,25364,info,Job kafkajson2gp is started
如果存在以下报错:
start job failed, err: rpc error: code = Unknown desc = InitJob: Formatter check failed: gpss formatter version is too old: pq: function gpfmt_version() does not exist
这个报错是因为Greenplum数据库自带的gpss版本太低了,升级即可(greenplum每个节点都需要安装)。例如:
安装包下载地址:
https://network.pivotal.io/products/greenplum-streaming-server
GPSS tarball安装步骤:
(1)把你下载的GPSS tarball文件上传到或者下载到Greenplum集群的主机。
(2)通过Greenplum管理员账户登陆到Greenplum集群主机,并设置环境变量,例如:
$ ssh gpadmin@<gphost>
gpadmin@gphost$ . /usr/local/greenplum-db/greenplum_path.sh
(3)解压tar.gz文件,例如:
gpadmin@gphost$ tar xzvf gpss-gpdb5-1.3.5-rhel6-x86_64.tar.gz
解压后会在当前目录生成一个名为gpss-gpdb5-1.3.5-rhel6_x86_64/的目录,该目录包含 bin/、lib/ 和 share/目录,以及一个名为install gpdb_component的安装脚本。
(1)切换到解压后的目录。
gpadmin@gphost$ cd gpss-gpdb5-1.3.5-rhel6_x86_64
(2)执行安装脚本,安装新的GPSS组件到$GPHOME,例如:
gpadmin@gphost$ ./install_gpdb_component
停止kafkajson2gp作业:
#gpsscli stop kafkajson2gp --gpss-host etlHost --gpss-port 5019
移除作业:
#gpsscli remove kafkajson2gp --gpss-host etlHost --gpss-port 5019
检查gpss命令的输出和日志文件,以查找标识插入/拒绝的行数的消息:
#cat ./gpsslogs/progress_kafkajso_7c57fce7_20201127.log
timestamp,pid,level,batch_id,start_time,end_time,total_byte,speed,total_read_count,inserted_rows, rejected_rows
20201127 16:58:21.67019,22469,info,1,2020-11-27 08:58:19.487+00,2020-11-27 08:58:21.622+00,568,266B,9,9,0
在目标Greenplum数据库查看表json_from_kafka:
testdb=# select * from testgpss.json_from_kafka;
customer_id | month | amount_paid
-------------+-------+-------------
1313131 | 12 | 1313.13
1313131 | 10 | 492.83
1313131 | 11 | 368.27
1313131 | 12 | 1313.13
1313131 | 10 | 492.83
1313131 | 11 | 368.27
7979797 | 10 | 4489.00
7979797 | 11 | 18.72
7979797 | 12 | 173.18
7979797 | 10 | 4489.00
7979797 | 11 | 18.72
7979797 | 12 | 173.18
3535353 | 11 | 761.35
3535353 | 10 | 6001.94
3535353 | 12 | 81.12
3535353 | 11 | 761.35
3535353 | 10 | 6001.94
3535353 | 12 | 81.12
(18 rows)
根据官方文档整理,如有不妥,请联系本人删除。
参考文档:https://gpdb.docs.pivotal.io/streaming-server/1-3-6/gpss-v13-release-notes.html