install greenplum database system

1Introduction to Greenplum

gp通过把任务分配给务个服务器或主机,来应对存储及处理大数据量操作的需求。

 

一个GP逻辑数据库就是一组独立的postgresql数据库,都同时工作以表现出一个独立的数据库映像

 

masterGP数据库系统的接入点。客户端通过master才能连接数据库,并提交SQL请求。master负责协调数据库系统中的各个数据库实例的工作。segments负责处理数据及存储要求。各个segments之间,以及segmentsmaster之间的通信都是通过内部互连实现,也就是GP数据库的网络层。

 

master存储放system catalog(数据字典信息),但是master不存储任何用户数据。用户数据都存储在segments中。master处理:验证客户连接的有效性,处理输入的SQL语句,在各个segments之间分发工作任务,组织各个segments返回的结果数据,向客户端展现最终的结果。

 

master不存储用户数据,所以其负载很小。然而,在生产环境中通常还是需要额外的存储空间,以存放装载文件和备份文件。客户可能还需要在master上运行ETL和报表工具,因此需要多一些磁盘空间和CPU处理能力。master确实也需要一个快速,独立的CPU以处理数据装载,客户端连接请求和查询计划的生成。

 

对于master可以选择配置备份或镜像。一个备份master主机以热备方式在运行,以防止主master机器出现问题不能使用了。可以把备用master部署在指定的冗余master主机上或者是一个segments主机上。

 

备用master主机通过事务日志复制进程保持更新,该复制进程运行在备用master主机上,保持主,备master主机的数据同步。如果主master出现故障,日志复制进程会关闭,备用master会被激活,并取代原来的主master。在激活备用master时,复制过来的日志被用来重建主masters机器在最后成功提交事务时的状态。

 

因为master并不存储用户数据,只是一些系统字典表需要在主,备master间同步。这些表不会经常修改,但当他们被修改时,变化会自动复制到备用master,这样可以时刻与主master保持同步。

 

segments负责存储用户数据,也是处理查询时的主要地点。用户数据及索引在GP数据库系统中的所有可用segments间分布,每一个segment存放着数据的不同部分。segment实例是数据库的服务进程,服务于segment。用户不能直接与segments连接,只能通过master连接GP数据库系统。

 

GP推荐的硬件配置中,每一个活动的segment需要一个独立的CPUCPU内核。例:如果每个主机上有两个双核的处理器,那么你可以在每个主机上配置4primary segments

 

部署GP数据库系统时,可以选择配置镜像segments。镜像segments可以使数据库查询在主segment不可用时切换到备用segment上。为了配置镜像,GP系统中需要有足够多的主机,这样可以把第二个segments安排在一个与主segment不同的主机上。

 

当系统中配置了镜像后,在主segment出现故障时,系统可以自动切换到镜像segment上。如果一个segment实例或主机宕机了,只要数据的各个部分在剩余的活动segments上是可用的,GP数据库系统可以继续保持可用。

 

无论何时,只要master不能连接到segment实例,他就会在数据字典中把相应的segment实例标记为不可用。Segment实例会保持无效并且不可操作,直到恢复上线的步骤实施了以后。一旦一个segment恢复在线,master在下一次可以成功连接到该segment时,将会标记该segment为有效可用的。

 

恢复一个失败的segment与配置的fault operational mode有关。如果系统运行在只读模式(默认的)下,当系统中有失败segments时,用户不能执行DDLDML操作。在只读模式下,只需要中断服务很小一段时间就可以恢复一个失败的segment。如果系统运行在continue模式下,只要与数据的某一部分对应的有一个活动segment实例存在,所有的操作将会继续进行。在这种模式下,失败segment上的数据必须与活动segment上的数据同步一致后,失败的segment才能重新恢复到在线操作状态。系统必须关闭以恢复失败segments

 

如果系统没有配置镜像,在出现segment实例无效时系统会自动关闭,必须恢复所有失败的segments后数据库才可以继续操作。

 

Example Segment Host Hardware Stack

与选择的硬件平台无关,生产环境的GP数据库一个工作节点(一个segment主机)典型的会被配置成如下:

Segment主机们负责主要的数据处理工作,所以segment主机服务器会被配置成为了获得可能的最高的性能。GP数据库的性能将会跟组中的最慢的segment服务器一样快。因此,确保运行GP数据库系统相关的底层硬件以及操作系统都运行在最优的性能是十分重要的。同样建议一个GP数据库组中所有的segment主机都有相同的硬件资源及配置。

 

Segment主机应该只运行GP数据库系统。为了得到最高的查询性能,不能让GP数据库与其他应用争夺机器或网络资源。

 

一个示例的GP系统segment主机硬件组织图。一个segment主机上的有效的CPU数量是决定该主机可以部署多少个主segment实例的基础。示例中:一个主机有两个有效的CPU(一个双核CPU)。这里每个CPU 内核有一个主segment实例(或者如果使用镜像的话,则是主/镜像配对)

 

典型的,一个CPU与一个逻辑磁盘对应。A logical disk consists of one primary file system (and optionally a mirror file system) accessing a pool of physical disks through an I/O channel or disk controller。逻辑磁盘,文件系统由操作系统提供支持。Most operating systems provide the ability for a logical disk drive to use groups of physical disks arranged in RAID arraysDepending on the hardware platform. you choose, different RAID configurations offer different performance and capacity levels.

InterconnectGP数据库的网络层。当用户连接到数据库并执行查询时,在segments上将创建进程以处理这个查询请求。Interconnectsegments间的内部进程通信有关,as well as the network infrastructure on which this communication relies. The interconnect uses a standard Gigabit Ethernet switching fabric

 

默认的内部通信通过网络使用UDP发送信息。GP软件执行额外的包有效性验证,UDP不负责检查,所以可靠性与TCP相同,并且性能与scalability(可扩展,可伸缩性)超过TCP相应的功能。

 

通过在网络中部署两个千M交换机可以获得高可用的内部互连网络。

 

典型配置:一个segment主机上的每一个主segment实例要有一个网卡如果使用镜像,一个主/镜像对将共享一个网卡Master主机需要有4个网卡与GP数据库组对应,另外再加两个对外提供服务的网卡(共6个网卡)

 

一个segment主机,需要为每一个网卡创建一个单独的主机名。例如:如果一个主机有四个网卡,他需要有4个相应的主机名,每一个主机名将映射到一个主segment实例上。同样也需要给master主机作相同的配置,然而,在初始化GP数据库组时,在GP组中只会用到一个master主机名。

 

通过这样的配置,操作系统自动选择到达目的地的最佳路径。GP自动平衡网络目的地以达到最大并行(Greenplum Database automatically balances the network destinations to maximize parallelism.)。

 

GP数据库系统组中使用多个千M交换机时,在各个交换机之间平分子网。在示例配置中,如果有两个交换机,每一个主机上的#1#2网卡使用#1交换机,#3,#4网卡将使用#2交换机。对于master主机,绑定到1#网卡(当然使用1#交换机)的主机名对于GP组来说是有效的主机名。因此,如果为了冗余而部署热备master,备用master的主机名映射到的网卡,需要跟主master使用不同的交换机的一个网卡

 

ETL Hosts for Data Loading

――――――

 

GP同样也提供可选的性能监控特性,管理员可以安装并使用该功能管理数据库。为了使用GP性能监控,GP数据库组中的每个主机上都需要安装一个代理。当启动GP性能监控时,代理开始收集查询和系统使用相关的数据。Segment代理每隔15秒钟将收集的数据发到master。用户可以查询GP性能监控数据库,查看活动查询和历史查询相关的系统性能数据。GP有一个WEB图形用户界面可以查看这些性能情况。

 

2Estimating Storage Capacity

评估你的GP数据库系统可以存储多少数据,可以使用以下规则作为指导。同时也要雇,可了在每个segment主机上存放备份文件和数据装载文件,可能需要额外的存储空间。

 

为了计算 GP系统可以存储多少数据,必须计算每个segment主机的可用磁盘空间,然后跟GP组中的segment主机数相乘。从一个segment主机上可以存储数据的原始物理磁盘容量(raw_capacity)开始,也就是:

 

Disk_size * number_of_disks

 

考虑到文件系统格式化头(大约10%)和使用的RAID级别。如果使用RAID-10,计算结果为:

 

(raw_capacity * 0.9) /2 = formatted_disk_space

 

为了优化性能,GP建议不要把磁盘的容量全部用完,最多使用70%或者更少。因此根据这个建议,计算可用磁盘空间:

 

Formatted_disk_space * 0.7 = usable_disk_space

 

在格式化RAID磁盘组并根据建议计算出最大容量(usable_disk_space)后,就需要计算用户数据(U)实际可用的存储空间。如果使用GP镜像作数据冗余,那么实际需要的存储空间将是用户数据的2倍(2*U)。GP也需要保留一部分空间作为活动查询的工作区。工作区大约需要用户数据大小的1/3work space = U/3)。

With mirrors: (2*U) +U/3 = usable_disk_space

Without mirrors: U + U/3 = usable_disk_space

 

Calculating User Data Size

对于所有数据库,原始数据一旦装入数据多少都会有些变大。平均情况下,原始数据装入数据库后将是原来的1.4倍大小,但是变大或变小与所使用的数据类型,表存储类型,数据库压缩等因素有关。

 

Page Overhead页头:当数据装入GP数据库时,数据被分成32KB大小的页。每一页有一个10bytes的页头。

 

Row Overhead行头:通常的heap组织表,每一行数据都有一个24bytes的行头。一个append-only组织表只有4bytes的行头。

 

Attribute Overhead头属性:对于数据本身,每一个属性值的大小 与所选择的数据类型有关。通常情况下,需要用可能的最小数据类型用户数据(假定知道每一个列可能含有的值)。

 

Indexes索引:GP数据库中,索引跟表数据一样分散在各个segment主机中。GP数据库默认的索引类型是B树索引。因为索引大小与索引中的唯一值的数量及索引中插入的数据有关,预先计算索引的实际大小不太可能。然而,可以使用这个公式大体的估算索引的大小:

B-treeunique_values *(data_type_size +24bytes)

Bitmap: (unique_values * number_of_rows * 1bit *compression_ratio/8) + (unique-values *32)

 

Calculating Space Requirements for Metadata and Logs

 

对于每个segment主机,也需要计算GP数据库日志,元数据所需要的空间:

System metadata:对于每一个segment实例(主或镜像)或master实例,大约需要20M的空间存放系统数据字典,元数据

 

Write ahead log:对于每个GP数据库实例(主或镜像)或master实例,都需要为write ahead logWHL)分配空间。WAL被分成64M一个的segment files(段文件)。最多情况下,WAL文件的数量将是:

2 * checkpoint_segments + 1

可以根据这个公式来估算WAL可能需要的空间。对GP数据库实例来说,Checkpoint_segments的默认值是8,意味着要为一个主机上的每一个segment实例或master实例分配1088MWAL空间。

 

Greenplum database log files每一个segment实例或master实例都会产生数据库日志文件,随着时间而增长的。要为这些日志分配足够的空间,and some type of log rotation facility should be used to ensure that to log files do not grow too large

Performance monitor dataGP性能监控代理跟GP数据库实例运行在相同的机器上,同样使用这些机器的资源。在这些主机上运行的GP性能监控代理进程消耗的资源很少,不会明显的影响系统性能。GP性能监控代理收集的有历史数据存放在GP数据库系统中的gpperfmon库中。收集到的监控数据跟普通的数据库数据一样被分散,因此也需要计算GP segment实例的数据目录所在位置的磁盘空间(so you will need to account for disk space in the data directory locations of your Greenplum segment instances)。空间大小取决于想要保存的历史数据的数量。

 

3. Configuring your Systems for Greenplum

Configuring OS Parameters for Greenplum

In general, the following categories of system parameters need to be altered:

Shared Memory - A Greenplum Database instance will not work unless the shared memory segment for your kernel is properly sized. Most default OS installations have the shared memory values set too low for Greenplum Database. On Linux systems, you must also disable the OOM (out of memory) killer.

Network - On high-volume Greenplum Database systems, certain network-related tuning parameters must be set to optimize network connections made by the Greenplum interconnect.

l        User Limits - User limits control the resources available to processes started by a user’s shell. Greenplum Database requires a higher limit on the allowed number of file descriptors that a single process can have open. The default settings may cause some Greenplum Database queries to fail because they will run out of file descriptors needed to process the query.

 

Linux

Set the following parameters in the /etc/sysctl.conf file and reboot:

kernel.shmmax = 500000000

kernel.shmmni = 4096

kernel.shmall = 4000000000

kernel.sem = 250 64000 100 512

net.ipv4.tcp_tw_recycle=1

net.ipv4.tcp_max_syn_backlog=4096

net.core.netdev_max_backlog=10000

net.ipv4.conf.default.arp_filter=1

net.ipv4.conf.all.arp_filter=1

vm.overcommit_memory=2

Set the following parameters in the /etc/security/limits.conf file:

* soft nofile 65536

* hard nofile 65536

* soft nproc 131072

* hard nproc 131072

 

Running the Greenplum Installer

To configure your systems for Greenplum Database, you will need certain utilities found in $GPHOME/bin of your installation. Log in as root and run the Greenplum installer on the machine that will be your master host.

 

To install the Greenplum binaries on the master host

1.Download or copy the installer file to the machine that will be the Greenplum Database master host. Installer files are available from Greenplum for RedHat (32-bit and 64-bit), Solaris 64-bit and SuSe Linux 64-bit platforms.

2.Unzip the installer file where PLATFORM. is either RHEL4-i386 (RedHat 32-bit), RHEL4-x86_64 (RedHat 64-bit), SOL-x86_64 (Solaris 64-bit) or SuSE10-x86_64 (SuSe Linux 64 bit). For example:

# unzip greenplum-db-4.0.0.0-PLATFORM.zip

3.Launch the installer using bash. For example:

# /bin/bash greenplum-db-4.0.0.0-PLATFORM.bin

4.The installer will prompt you to accept the Greenplum Database license agreement. Type yes to accept the license agreement.

5.The installer will prompt you to provide an installation path. Press ENTER to accept the default install path (/usr/local/greenplum-db-4.0.0.0), or enter an absolute path to an install location. You must have write permissions to the location you specify.

6.The installer will install the Greenplum software and create a greenplum-db symbolic link one directory level above your version-specific Greenplum installation directory. The symbolic link is used to facilitate patch maintenance and upgrades between versions. The installed location is referred to as $GPHOME.

About Your Greenplum Database Installation

greenplum_path.sh — This file contains the environment variables for Greenplum Database. See “Configuring Greenplum Environment Variables” on page 22.

bin — This directory contains the Greenplum Database management utilities. This directory also contains the PostgreSQL client and server programs, most of which are also used in Greenplum Database.

demo — This directory contains the Greenplum demonstration programs.

docs — The Greenplum Database documentation (PDF files).

ext — Bundled programs (such as Python) used by some Greenplum Database utilities.

include — The C header files for Greenplum Database.

lib — Shared library files for Greenplum Database.

share — Shared files for Greenplum Database.

 

Creating the Greenplum Administrative User

You cannot run the Greenplum Database server as root. Greenplum recommends that you designate a user account that will own your Greenplum Database installation, and to always start and administer Greenplum Database as this user. For the purposes of this documentation, we will use the user name of gpadmin. You can choose any user name you like, but be sure to use the same user name consistently on all hosts in your Greenplum Database system.

To add a new user, for example, run the following commands as root:

# useradd gpadmin

# passwd gpadmin

# New password: password

# Retype new password: password

Creating the gpadmin User on Multiple Hosts at Once

Greenplum provides a utility called gpssh that allows you to open an ssh session and run commands on multiple hosts at once. This functionality allows you to perform. administrative tasks on all of your segment hosts at the same time. gpssh requires a trusted-host environment (logging on to remote systems without a password prompt). In order to use gpssh to perform. system administration tasks such as creating users, you must first exchange ssh keys as root. Greenplum provides a utility called gpssh-exkeys, which sets up a trusted host environment as the currently logged in user.

To exchange ssh keys as root

1.Log in as root on the master host, and source the greenplum_path.sh file from your Greenplum installation.

# source /usr/local/greenplum-db/greenplum_path.sh

2.Create a host list file that has one host name per line and includes a host name for each host in your Greenplum system (master, standby master and segments). Make sure there are no blank lines or extra spaces. If a host has multiple configured host names, use only one host name per host(只是为创建用户使用,所以只需要有一个与目标主机对应的机器名就可以了,多个的话,创建多次能成功吗). For example:

mdw

sdw1-1

sdw2-1

sdw3-1

3.Run the gpssh-exkeys utility referencing the host list file (all_hosts_file上面刚创建的文件名) you just created. For example:

# gpssh-exkeys -f all_hosts_file

4.gpssh-exkeys will check the remote hosts and perform. the key exchange between all hosts. Enter the root user password when prompted. For example:

***Enter password for root@hostname: root password

To create the gpadmin user

5.Run gpssh to create the gpadmin user on all hosts (if it does not exist already). Use the all_hosts_file you created in step 2. For example:

# gpssh -f all_hosts_file '/usr/sbin/useradd gpadmin -d /home/gpadmin -s /bin/bash'

6.Set the new gpadmin user’s password. On Linux, you can do this on all segment hosts at once using gpssh. For example:

# gpssh -f all_hosts_file 'echo password | passwd gpadmin --stdin'

上面的password是需要替换成实际的密码。

On Solaris, you must log in to each segment host and set the gpadmin user’s password on each host. For example:

# ssh segment_hostname

# passwd gpadmin

#

7.After the gpadmin user is created, change the ownership of your Greenplum master installation directory to this user. For example:

# chown -R gpadmin /usr/local/greenplum-db

 

Configuring Greenplum Environment Variables

You must configure your environment on the Greenplum Database master (and standby master). A greenplum_path.sh file is provided in your $GPHOME directory with environment variable settings for Greenplum Database. You can source this in the gpadmin user’s startup shell profile (such as .bashrc).

For example, you could add a line similar to the following to your chosen profile files:

source /usr/local/greenplum-db/greenplum_path.sh

After editing the chosen profile file, source it as the correct user to make the changes active. For example:

$ source ~/.bashrc

Note: The .bashrc file should not produce any output. If you wish to have a message display to users upon logging in, use the .profile file.

 

Installing the Greenplum Binaries on Multiple Hosts

Greenplum provides a utility called gpscp that allows you to copy files to multiple hosts at once. Using gpscp and gpssh, you can install the Greenplum Database software on all of your segment hosts at once.

To install the Greenplum software on the segment hosts

1.On the master host, create a tar file of your Greenplum Database installation. For example (running as root):

# su -

# cd /usr/local

# gtar -cvf /home/gpadmin/gp.tar greenplum-db-4.0.0.0

2.Create a host list file that has one host name per line and includes a host name for each segment host in your Greenplum system. Make sure there are no blank lines or extra spaces. If a host has multiple configured host names, use only one host name per host(只需要一个机器名就可以了,多个机器名实际上对应的是一个机器,多次复制只起到了覆盖的作用). For example:

sdw1-1

sdw2-1

sdw3-1

3.Copy the tar file to the segment hosts using gpscp, where seg_hosts_file(上面刚创建的文件) is the host list file you just created. For example:

# gpscp -f seg_hosts_file /home/gpadmin/gp.tar =:/usr/local

4.Start an interactive session in gpssh using the same seg_hosts_file host list file. For example:

# gpssh -f seg_hosts_file

5.At the gpssh command prompt, untar the tar file in the installation directory on the segment hosts. For example:

=> gtar --directory /usr/local -xvf /usr/local/gp.tar

6.Confirm that the Greenplum Database directory was installed in the correct location (the same location as $GPHOME on your master host). For example:

=> ls /usr/local/greenplum-db-4.0.0.0

7.Create a greenplum-db symbolic link to point to the current version directory of your Greenplum Database software. For example:

=> ln -s /usr/local/greenplum-db-4.0.0.0 /usr/local/greenplum-db

8.Change the ownership of the Greenplum Database install directory to the gpadmin user. For example:

=> chown -R gpadmin /usr/local/greenplum-db

9.Remove the tar file. For example:

=> rm /usr/local/gp.tar执行的时候就不动了,好像执行不了啊

10.Exit gpssh interactive mode:

=> exit

 

 

Creating the Data Storage Areas

Every Greenplum Database master and segment instance has a designated storage area on disk that is called the data directory location. This is the file system location where the database data will be stored. Each master and segment instance needs its own designated data directory storage location.

To create the data directory location on the master

The data directory location on the master is different than those on the segments. The master does not store any user data, only the system catalog tables and system metadata are stored on the master instance, therefore you do not need to designate as much storage space as on the segments.

1.Create or choose a directory that will serve as your master data storage area. This directory should have sufficient disk space for your data and be owned by the gpadmin user and group. For example, run the following commands as root:

# mkdir /gpmaster

2.Change ownership of this directory to the gpadmin user and group. For example:

# chown gpadmin /gpmaster

# chgrp gpadmin /gpmaster

To create the data directory location on a segment

On each segment host, create or choose the directories that each segment will use to store data. For example, if a segment host has two segments, run the following commands as root:

$ mkdir /gpdata1

$ mkdir /gpdata2

Change ownership of these directories to the gpadmin user and group. For example:

$ chown gpadmin /gpdata1

$ chgrp gpadmin /gpdata1

$ chown gpadmin /gpdata2

$ chgrp gpadmin /gpdata2

Setting up a Trusted Host Environment

The Greenplum Database management utilities require a trusted-host environment (logging on to remote systems without a password prompt). To perform. Greenplum administration tasks using these utilities, you must first exchange ssh keys as the Greenplum administrative user (gpadmin). Greenplum provides a utility called gpssh-exkeys, which sets up a trusted host environment as the currently logged in user. When exchanging keys as gpadmin, you must use all configured host names for the hosts in your Greenplum Database system (master, standby master and segments).

To exchange SSH keys as the gpadmin user

1.Log in as gpadmin:

$ su - gpadmin

2.Create a host list file that has one host name per line and includes all host names for all hosts in your Greenplum system. Make sure there are no blank lines or extra spaces. If a host has multiple configured host names, use all of the configured host names. For example:

mdw

mdw-1

mdw-2

mdw-3

mdw-4

sdw1-1

sdw1-2

sdw1-3

sdw1-4

sdw2-1

sdw2-2

sdw2-3

sdw2-4

sdw3-1

sdw3-2

sdw3-3

sdw3-4

3.Run the gpssh-exkeys utility referencing the host list file (all_host_interfaces_file) you just created:

$ gpssh-exkeys -f all_host_interfaces_file

4.gpssh-exkeys will check the remote hosts and perform. the key exchange between all hosts. Enter the gpadmin user password when prompted. For example:

***Enter password for gpadmin@hostname: gpadmin password

 

Synchronizing System Clocks

Greenplum recommends that you synchronize the system clocks on all hosts in the array using NTP (Network Time Protocol) or a similar utility. See www.ntp.org for more information about NTP. To see if the system clocks are synchronized, run the date command using gpssh. For example:

$ gpssh -f seg_hosts_file -v date

If you have the NTP daemon installed on your segment hosts, you can use it to synchronize the system clocks. For example:

$ gpssh -f seg_hosts_file -v ntpd

Next Steps

After you have configured the operating system environment and installed the Greenplum Database software on all of the hosts in the system, the next steps are:

“Validating Your Systems” on page 27

“Initializing a Greenplum Database System” on page 37

 

 

 

4. Validating Your Systems

Greenplum provides the following utilities to validate the configuration and performance of your systems:

gpcheckos

gpcheckperf

These utilities can be found in $GPHOME/bin of your Greenplum installation.

The following tests should be run prior to initializing your Greenplum Database system.

Validating OS Settings

Validating Hardware Performance

Validating OS Settings

Greenplum provides a utility called gpcheckos that can be used to verify that all hosts in your array have the correct OS settings for running the Greenplum Database software. To run gpcheckos:

1.Log in on the master host as the user who will be running your Greenplum Database system (for example, gpadmin).

2.Create a host file that has the host names of all segment hosts to include in the verification test (one host name per line). Make sure there are no blank lines or extra spaces. If using a a multi-NIC configuration, this file would just have a single host name per segment host. For example:

sdw1-1

sdw2-1

sdw3-1

3.Run the gpcheckos utility using the host file you just created. For example:

$ gpcheckos -f segment_host_file

4.Look for lines prefixed with [FIX username@hostname]. These lines will explain OS-level fixes that need to be made before you initialize Greenplum Database.

5.Run the gpcheckos utility again, this time only checking the master host. For example:

gpcheckos -h mdw-1

Validating Hardware Performance

Greenplum provides a management utility called gpcheckperf, which can be used to identify hardware and system-level issues on the machines in your Greenplum Database array. gpcheckperf starts a session on the specified hosts and runs the following performance tests:

Network Performance (netperf)

Disk I/O Performance (dd test)

Memory Bandwidth (stream test)

Before using gpcheckperf, you must have a trusted host setup between the hosts involved in the performance test. You can use the utility gpssh-exkeys to update the known host files and exchange public keys between hosts if you have not done so already. Note that gpcheckperf calls to gpssh and gpscp, so these Greenplum utilities must be in your $PATH.

Validating Network Performance

To test network performance, run gpcheckperf with one of the network test run options: parallel pair test (-r N), serial pair test (-r n), or full matrix test (-r M). The test uses the netperf TCP_STREAM test, which transfers a 5 second stream of data between the network interfaces included in the test. By default, interfaces are tested serially in pairs and per-interface results are reported. There are also options to run parallel or full matrix tests, which report a summary network performance result in MB per second (MB/s). Your average network transfer rate should not be less than 100 MB/s.

Most systems in a Greenplum Database array are configured with multiple network interface cards (NICs), each NIC on its own subnet. When testing network performance, it is important to test each subnet individually. For example, considering the following standard network configuration of four NICs per host:Table 4.1 Example Network Interface Configuration

 

 

Greenplum Host

Subnet1 NICs

Subnet2 NICs

Subnet3 NICs

Subnet4 NICs

Master

mdw-1

mdw-2

mdw-3

mdw-4

Segment 1

sdw1-1

sdw1-2

sdw1-3

sdw1-4

Segment 2

sdw2-1

sdw2-2

sdw2-3

sdw2-4

Segment 3

sdw3-1

sdw3-2

sdw3-3

sdw3-4

 

You would create four distinct host files for use with the gpcheckperf network test:

 

Table 4.2 Example Network Test Host File Contents

host_file_nic1

host_file_nic2

host_file_nic3

host_file_nic4

mdw-1

mdw-2

mdw-3

mdw-4

sdw1-1

sdw1-2

sdw1-3

sdw1-4

sdw2-1

sdw2-2

sdw2-3

sdw2-4

sdw3-1

sdw3-2

sdw3-3

sdw3-4

 

You would then run gpcheckperf once per subnet. For example (if testing an even number of hosts, run in parallel pairs test mode)(如果按以下命令执行,则在执行完后需要检查相应的输出文件,看里面有没有错误信息)

$ gpcheckperf -f host_file_nic1 -r N -d /tmp > subnet1.out

$ gpcheckperf -f host_file_nic2 -r N -d /tmp > subnet2.out

$ gpcheckperf -f host_file_nic3 -r N -d /tmp > subnet3.out

$ gpcheckperf -f host_file_nic4 -r N -d /tmp > subnet4.out

If you have an odd number of hosts to test, you can either run in serial test mode (-r n) or full matrix test mode (-r M).

 

Validating Disk I/O and Memory Bandwidth

To test disk and memory bandwidth performance, run gpcheckperf with the disk and stream test run options (-r ds). The disk test uses the dd command (a standard UNIX utility) to test the sequential throughput performance of a logical disk or file system. The memory test uses the STREAM benchmark program to measure sustainable memory bandwidth. Results are reported in MB per second (MB/s).

To run the disk and stream tests

1.Create a host file that has one host name per segment host(只取主机的一个主机名,没有master主机). Do not include the master host. For example:

sdw1-1

sdw2-1

sdw3-1

sdw4-1

2.Run the gpcheckperf utility using the host file you just created. Use the -d option to specify the file systems you want to test on each host (you must have write access to these directories). You will want to test your primary and mirror segment data directory locations. For example:

$ gpcheckperf -f seg_host_file -r ds -D -d /data/gpdb_p1 \

-d /data/gpdb_p2 -d /data/gpdb_p3 -d /data/gpdb_p4 \

-d /data/gpdb_m1 -d /data/gpdb_m2 -d /data/gpdb_m3 \

-d /data/gpdb_m4

3.The utility may take a while to perform. the tests as it is copying very large files between the hosts. When it is finished you will see the summary results for the Disk Write, Disk Read, and Stream tests.

 

 

 

5. Configuring Localization Settings

 

6. Initializing a Greenplum Database System

Overview

Because a Greenplum Database database is distributed across many machines, the process for initializing the database is different than in PostgreSQL. With a regular PostgreSQL DBMS, you run a utility called initdb which creates the data storage directories, generates the shared catalog tables and configuration files, and creates the template1 database, which is the template used to create other databases.

In a Greenplum Database DBMS, each database instance (the master and all segments) must be initialized across all of the hosts in the system in such a way that they can all work together as a unified DBMS. Greenplum provides its own version of initdb called gpinitsystem, which takes care of initializing the database on the master and on each segment instance, and starting each instance in the correct order.

After the Greenplum Database database system has been initialized and started, you can then create and manage databases as you

 

Initializing Greenplum Database

These are the high-level tasks for initializing Greenplum Database:

1.Make sure you have completed all of the installation tasks described in Chapter 3, “Configuring your Systems for Greenplum”.

2.(optional) If you want a standby master host, make sure that host is installed and configured before you initialize.

3.Create a host file that contains the host name of each segment host. See “Creating a Host List File” on page 38.

4.Create your Greenplum Database system configuration file. See “Creating the Greenplum Database Configuration File” on page 39.

   5.Run the Greenplum Database initialization utility on the master host. See “Running the Initialization Utility” on page 40.

Creating a Host List File

The host list file is used by the gpinitsystem system initialization utility to determine the hosts on which to create the Greenplum Database segment instances. This file has just the host names of the segment hosts. If using a a multi-NIC configuration, this file should have all per-interface host names for each segment host — one host name per line.

To create the host list file

Note: If available, you can reuse the seg_hosts_file you created in Chapter 3, “Configuring your Systems for Greenplum”.

 

1.Create a file in any location you like. The examples in this documentation call this file seg_hosts_file. For example:

$ vi seg_hosts_file

2.In this file add the host name of each segment host, one name per line, no extra lines or spaces. If using multiple network interfaces per segment host, you must initialize the array using all configured host names of each segment host(包含全部网卡对应的主机名,一个网卡一个segment,所有有多少个NIC就要安装多少个segment instance). For example:

sdw1-1

sdw1-2

sdw1-3

sdw1-4

sdw2-1

sdw2-2

sdw2-3

sdw2-4

sdw3-1

sdw3-2

sdw3-3

sdw3-4

3.Save and close the file.

4.If you created this file as root, make sure to change the ownership to the gpadmin user or group. For example:

# chown gpadmin seg_hosts_file

# chgrp gpadmin seg_hosts_file

5.Note the location where this file resides, as you will need to specify it for the MACHINE_LIST_FILE parameter in the next task, “Creating the Greenplum Database Configuration File” on page 39.

 

Creating the Greenplum Database Configuration File

Your Greenplum Database configuration file tells the gpinitsystem initialization utility how you want to configure your Greenplum Database system. An example configuration file can be found in $GPHOME/docs/cli_help/gp_init_config_example. Also see “Initialization Configuration File Format” on page 72 for a detailed description of each parameter.

To create a gp_init_config file

1.Make a copy of the gp_init_config_example document to use as a starting point. For example:

$ cp $GPHOME/docs/cli_help/gp_init_config_example /home/gpadmin/gp_init_config

2.Open the file you just copied in a text editor. For example:

$ vi gp_init_config

3.Set all of the required parameters according to your environment. See “Initialization Configuration File Format” on page 72 for more information. A Greenplum Database system must contain a master instance and at least two segment instances (even if setting up a demo system on a single host)(1 master + 2 segments). When creating the configuration file used by gpinitsystem, make sure you specify the correct number of segment instance data directories per host. Here is an example of the required parameters in the gpdb_init_config file:

ARRAY_NAME="Greenplum"

MACHINE_LIST_FILE=/home/gpadmin/multi_seg_hosts_file

SEG_PREFIX=gpseg

PORT_BASE=50000

declare -a DATA_DIRECTORY=(/gpdata1 /gpdata2 /gpdata3 /gpdata4) ####4 segmetns

MASTER_HOSTNAME=mdw1

MASTER_DIRECTORY=/gpmaster

MASTER_PORT=5432

TRUSTED SHELL=ssh

CHECK_POINT_SEGMENT=8 ###WAL有关

ENCODING=UNICODE

4.(optional) If you want to deploy mirror segments, set the mirroring parameters according to your environment. See “Initialization Configuration File Format” on page 72 for more information. You can also deploy mirrors later using the gpaddmirrors utility.

5.Save and close the file.

 

Running the Initialization Utility

The gpinitsystem utility will create a Greenplum Database system using the values defined in the configuration file.

To run the initialization utility

1.Run the following command referencing the path and file name of your initialization configuration file (gp_init_config). For example:

$ gpinitsystem -c /home/gpadmin/gp_init_config

If deploying an optional standby master, you would run:

$ gpinitsystem -c /home/gpadmin/gp_init_config -s standby_master_hostname

2.The utility will verify your setup information and make sure it can connect to each host and access the data directories specified in your configuration. If all of the pre-checks are successful, the utility will prompt you to confirm your configuration. For example:

=> Continue with Greenplum creation? y

3.The utility will then begin setup and initialization of the master instance and each segment instance in the system. Each segment instance is set up in parallel. Depending on the number of segments, this process can take a while.

4.At the end of a successful setup, the utility will start your Greenplum Database system. You should see:

=> Greenplum Database instance successfully created.

Troubleshooting Initialization Problems

If the utility encounters any errors while setting up an instance, the entire process will fail, and could possibly leave you with a partially created system. Refer to the error messages and logs to determine the cause of the failure and where in the process the failure occurred. Log files are created in ~/gpAdminLogs.

Depending on when the error occurred in the process, you may need to clean up and then try the gpinitsystem utility again. For example, if some segment instances were created and some failed, you may need to stop postgres processes and remove any utility-created data directories from your data storage area(s). A backout script. is created to help with this cleanup if necessary.

Using the Backout Script

If the gpinitsystem utility fails, it will create the following backout script if it has left your system in a partially installed state:

~/gpAdminLogs/backout_gpinitsystem_<user>_<timestamp>

You can use this script. to clean up a partially created Greenplum Database system. This backout script. will remove any utility-created data directories, postgres processes, and log files. After correcting the error that caused gpinitsystem to fail and running the backout script, you should be ready to retry initializing your Greenplum Database array.

The following example shows how to run the backout script.:

$ sh backout_gpinitsystem_gpadmin_20071031_121053

 

Setting the Master Data Directory Environment Variable

The Greenplum Database management utilities require that the MASTER_DATA_DIRECTORY environment variable be set. This should point to the directory created by the gpinitsystem utility in the master data directory location.

For example, you could add a line similar to the following to the gpadmin user’s profile file (such as .bashrc):

MASTER_DATA_DIRECTORY=/gpmaster/gp-1

export MASTER_DATA_DIRECTORY

After editing the chosen profile file, source it as the correct user to make the changes active. For example:

$ source ~/.bashrc

Next Steps

After your system is up and running, the next steps are:

Allowing Client Connections

Creating Databases and Loading Data

Allowing Client Connections

After a Greenplum Database is first initialized it will only allow local connections to the database from the gpadmin role (or whatever system user ran gpinitsystem). If you would like other users or client machines to be able to connect to Greenplum Database, you must give them access. See the Greenplum Database Administrator Guide for more information.

Creating Databases and Loading Data

After verifying your installation, you may want to begin creating databases and loading data. See the Greenplum Database Administrator Guide for more information about creating databases, schemas, tables, and other database objects in Greenplum Database and loading your data.

 

 

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/280958/viewspace-673023/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/280958/viewspace-673023/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值