GreenPlum入门

GreenPlum入门

六翁 2016-07-13 14345浏览量
简介: 本文旨在小白入门。 我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。 ### 1 GPDB简介 > Pivotal **Greenplum** Database is a MPP (massively parallel processing) database built on open source [PostgreSQL](https:/

本文旨在小白入门。

我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。
1 GPDB简介

    Pivotal Greenplum Database is a MPP (massively parallel processing) database built on open source PostgreSQL.The system consists of a master node, standby master node, and segment nodes.

    All of the data resides on the segment nodes and the catalog information is stored in the master nodes. Segment nodes run one or more segments, which are modified PostgreSQL database instances and are assigned a content identifier.

    For each table the data is divided among the segment nodes based on the distribution column keys specified by the user in the DDL statement.

    For each segment content identifier there is both a primary segment and mirror segment which are not running on the same physical host.

    When a SQL query enters the master node, it is parsed, optimized and dispatched to all of the segments to execute the query plan and either return the requested data or insert the result of the query into a database table.

从定义上看,GP和此前我比较专注的ES(Elasticsearch)非常类似,它们都是基于成熟的单机技术——PG(PostgreSQL)和Lucene,采用去中心化的分布式技术,实现了对大数据的(MPP和信息检索)支持。
2 准备

多台物理机搭建GPDB集群(1个主节点、1个备节点、n个Segment节点)是最佳的实战环境,但作为入门,GP官方提供的Docker镜像足够用了。使用如下命令拉取这个镜像:

docker pull pivotaldata/gpdb-base

接下来使用如下命令下载GP官方提供的教程示例,然后解压到任意目录(本例使用的路径为/Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737):

wget https://github.com/greenplum-db/gpdb-sandbox-tutorials/zipball/master

启动GP容器,并挂载上述教程示例:

docker rm $(docker ps -aq)

docker run -ti \
-p 5432:5432 \
--name gp_tutorials \
--hostname gp-tutorials \
-v /Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737:/tutorials \
pivotaldata/gpdb-base

如下的练习都是在上述这个容器中进行的。

容器内包含的用户/密码信息如下:

    root/pivotal
    gpadmin/pivotal

如下练习均使用gpadmin用户,记得切换:

su - gpadmin

3 创建用户和角色
3.1 Create a user with the createuser utility command

createuser -P user1

Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) y
Shall the new role be allowed to create more new roles? (y/n) n
NOTICE:  resource queue required -- using default resource queue
"pg_default"

3.2 Create a user with the CREATE USER command

psql template1

template1=# CREATE USER user2 WITH PASSWORD 'pivotal' NOSUPERUSER;

template1=# \du

                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 gpadmin   | Superuser, Create role, Create DB |
 user1     | Create DB                         |
 user2     |                                   |

3.3 Create a users group and add the users to it

template1=# CREATE ROLE users;
template1=# GRANT users TO user1, user2;

template1=# \du
                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 gpadmin   | Superuser, Create role, Create DB |
 user1     | Create DB                         | {users}
 user2     |                                   | {users}
 users     | Cannot login                      |

3.4 Exit out of the psql shell

template1=# \q

4 创建和准备数据库
4.1 Create Database

dropdb tutorial
createdb tutorial

psql -l
                  List of databases
   Name    |  Owner  | Encoding |  Access privileges
-----------+---------+----------+---------------------
 gpadmin   | gpadmin | UTF8     |
 postgres  | gpadmin | UTF8     |
 template0 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 template1 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 tutorial  | gpadmin | UTF8     |
(5 rows)

4.2 Grant database privileges to users

psql -U gpadmin tutorial

tutorial=# GRANT ALL PRIVILEGES ON DATABASE tutorial TO user1, user2;

tutorial=# \q

psql -l
                  List of databases
   Name    |  Owner  | Encoding |  Access privileges
-----------+---------+----------+---------------------
 gpadmin   | gpadmin | UTF8     |
 postgres  | gpadmin | UTF8     |
 template0 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 template1 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 tutorial  | gpadmin | UTF8     | =Tc/gpadmin
                                : gpadmin=CTc/gpadmin
                                : user1=CTc/gpadmin
                                : user2=CTc/gpadmin
(5 rows)

4.3 Create a schema and set a search path

psql -U user1 -h gp-tutorials tutorial

tutorial=> DROP SCHEMA IF EXISTS faa CASCADE;

tutorial=> CREATE SCHEMA faa;

tutorial=> SET SEARCH_PATH TO faa, public, pg_catalog, gp_toolkit;

tutorial=> SHOW search_path;
             search_path
-------------------------------------
 faa, public, pg_catalog, gp_toolkit
(1 row)

tutorial=> \q

5 创建表

cd /tutorials/faa
psql -U gpadmin tutorial
tutorial=#  \i create_dim_tables.sql
tutorial=# \dt

5.1 Data Loading

tutorial=# \d faa.d_cancellation_codes
Table "faa.d_cancellation_codes"
   Column    | Type | Modifiers
-------------+------+-----------
 cancel_code | text |
 cancel_desc | text |
Distributed by: (cancel_code)

5.2 Load data with the INSERT statement

tutorial=# INSERT INTO faa.d_cancellation_codes
tutorial-# VALUES ('A', 'Carrier'),
tutorial-# ('B', 'Weather'),
tutorial-# ('C', 'NAS'),
tutorial-# ('D', 'Security'),
tutorial-#  ('', 'none');
INSERT 0 5

5.3 Load data with the COPY statement

tutorial-# \i copy_into_airports.sql
tutorial-# \i copy_into_airlines.sql
tutorial-# \i copy_into_delay_groups.sql
tutorial-# \i copy_into_distance_groups.sql
tutorial-# \i copy_into_wac.sql
tutorial-# \q

5.4 Load data with gpdist

Execute gpfdist

gpfdist -d /tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &

ps -A | grep gpfdist

 1202 ?        00:00:00 gpfdist

more /tmp/gpfdist.log

...
Serving HTTP on port 8081, directory /tutorials/faa

psql -U gpadmin tutorial

tutorial=#  \i create_load_tables.sql
tutorial=#  \i create_ext_table.sql
tutorial=# INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp;

tutorial=#  \x

tutorial=# SELECT DISTINCT relname, errmsg, count(*) FROM faa.faa_load_errors GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg  | invalid input syntax for integer: "", column deptime
count   | 26526

tutorial=# \q

killall gpfdist

5.5 Load data with gpload

vi gpload.yaml

---
VERSION: 1.0.0.1
# describe the Greenplum database parameters
DATABASE: tutorial
USER: gpadmin
HOST: gp-tutorials
PORT: 5432
# describe the location of the source files
# in this example, the database master lives on the same host as the source files
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - gp-tutorials
         PORT: 8081
         FILE:
           - /tutorials/faa/otp*.gz
    - FORMAT: csv
    - QUOTE: '"'
    - ERROR_LIMIT: 50000
    - ERROR_TABLE: faa.faa_load_errors
   OUTPUT:
    - TABLE: faa.faa_otp_load
    - MODE: INSERT
   PRELOAD:
    - TRUNCATE: true

gpload -f gpload.yaml -l gpload.log

...
started gpfdist -p 8081 -P 8082 -f "/tutorials/faa/otp*.gz" -t 30
...
running time: 22.41 seconds
...
rows Inserted          = 1024552

5.6 Create and Load fact tables

psql -U gpadmin tutorial
tutorial=#  \i create_fact_tables.sql
tutorial=# \i load_into_fact_table.sql

6 查询和调优
6.1 Analyze the tables

tutorial=# ANALYZE faa.d_airports;
ANALYZE
tutorial=# ANALYZE faa.d_airlines;
ANALYZE
tutorial=# ANALYZE faa.d_wac;
ANALYZE
tutorial=# ANALYZE faa.d_cancellation_codes;
ANALYZE
tutorial=# ANALYZE faa.faa_otp_load;
ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_c;
ANALYZE

6.2 View explain plans

tutorial=# \timing on
tutorial=# \i create_sample_table.sql
tutorial=# EXPLAIN SELECT COUNT(*) FROM sample WHERE id > 100;
tutorial=# EXPLAIN ANALYZE SELECT COUNT(*) FROM sample WHERE id > 100;

6.3 Changing optimizers

tutorial=# \q

gpconfig -s optimizer

Values on all segments are consistent
GUC          : optimizer
Master  value: off
Segment value: off

gpconfig -c optimizer -v on --masteronly

gpstop -u

6.4 Indexes and performance

psql -U gpadmin tutorial

tutorial=# \i create_sample_table.sql

tutorial=# SELECT * from sample WHERE big = 12345;
   id    |  big  | wee | stuff
---------+-------+-----+-------
   12345 | 12345 |   0 |
 1012346 | 12345 |   0 |
 2012347 | 12345 |   0 |
(3 rows)

tutorial=# EXPLAIN SELECT * from sample WHERE big = 12345;

tutorial=# CREATE INDEX sample_big_index ON sample(big);

tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;

6.5 Row vs. Column orientation

tutorial=# CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true,orientation=column)DISTRIBUTED BY (UniqueCarrier, FlightNum) PARTITION BY RANGE(FlightDate)(PARTITION mth START('2009-06-01'::date) END ('2010-10-31'::date) EVERY ('1 mon'::interval));

tutorial=# INSERT INTO faa.otp_c SELECT * FROM faa.otp_r;

tutorial=# \d faa.otp_r

tutorial=# \d faa.otp_c

tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_r'));
 pg_size_pretty
----------------
 256 MB
(1 row)

tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_r'));
 pg_size_pretty
----------------
 256 MB
(1 row)

tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 0 bytes
(1 row)

tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 288 kB
(1 row)

tutorial=#
tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 288 kB
(1 row)

6.6 Check for even data distribution on segments

tutorial=# SELECT gp_segment_id, COUNT(*) FROM faa.otp_c GROUP BY gp_segment_id ORDER BY gp_segment_id;

6.7 About partitioning

tutorial=# \timing on

tutorial=#  SELECT MAX(depdelay) FROM faa.otp_c WHERE UniqueCarrier = 'UA';

tutorial=# SELECT MAX(depdelay) FROM faa.otp_c WHERE flightdate ='2009-11-01';

7 集成分析工具

    http://madlib.incubator.apache.org
    https://zeppelin.incubator.apache.org

8 备份和恢复操作

    To run a full backup:gpcrondump -x tutorial -u /tmp -a -r
    To view the backups:ls -al /tmp/db_dumps
    To restore the data:gpdbrestore -T faa.otp_r -s tutorial -u /tmp -a

9 Reference

    http://greenplum.org/gpdb-sandbox-tutorials
    https://github.com/pivotaldata/gpdb-docker

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值