GPDB-特性实践

date: 2020-01-11 15:51:39

前段时间导师要求了解 GreenPlum数据库,后来安装和使用了一下,感觉和其他数据库没有什么不同,于是就不了了之了。现在重新看一遍 GPDB 的特性并尝试使用这些特性。

其实,官方宣传页上写的特性才是真正需要我去了解的。

特性

首先,GPDB的特性是什么?从哪里找?产品的宣传页上肯定有。

最大特性

GPDB 的最大特性就是 MPP, 即 Massively Parallel Processing。 在首页上,最明显的就是这两个:

  • Massively Parallel, 大规模并行
  • Analytics,分析
    max
    然后,下滑页面,两个明显的特性是:
  • Power at scale: High performance on petabyte-scale data volumes. PB级数据高性能处理。
  • True Flexibility: Deploy anywhere. 部署灵活。
    main1
    main2

主要特性

首页接着往下滑,写明了 GPDB 的主要特性:

  • MPP Architecture
  • Petabyte-Scale Loading: 加载速度随着每个额外节点的增加而增加,每个机架的加载速度超过10Tb/h ( 约为 347.22 GB/s ) 。
  • Innovative Query Optimization: 工业界中首个大数据工作负载的 cost-based query optimizer.
  • Polymorphic Data Storage: 多态数据存储。完全控制表和*–**分区存储、执行和压缩的配置。
  • Integrated In-Database Analytics: Apache MADlib提供的一个库,用于可伸缩的数据库内分析,通过用户定义的函数扩展Greenplum数据库的SQL功能。
  • Federated Data Access: 联邦数据访问。使用Greenplum optimzer和 query processing engine 查询外部数据源。包括Hadoop、Cloud Storage、ORC、AVRO、Parquet等 Polygot 数据存储。
    main fearture
    遗憾的是,首页上的这些图标都不能点击。所以要体验这些特性,只能自己去探索了。

TODO

  1. 从 hadoop 中查询数据。

GPDB vs. Hadoop + Hive

GreenPlum vs. Hadoop
来源:Greenplum介绍, 这是 Alibaba 在 2011.02.17 做的汇报, GPDB 版本为 4.x.

对比 Hadoop + Hive, GPDB 的查询性能比 Hive 好,但是 GPDB 支持的集群节点数太少,最多可以 1000 个 segment, 而Hive 可支持上万个节点。

Data Loading

主要有 3 种加载方法:

  • the SQL INSERT statement: 在加载大量数据时低效,适用于小数据集。
  • the COPY command: 可以自定义 the format of the text file 以解析成 columns and rows. 比 INSERT 快, 但是不是一个 parallel process.
  • gpfdist and gpload: 可以高效的将外部数据转储到数据表中。快速,并行加载。Administrator 可以定义 single row error isolation mode 以继续加载格式正常的 rows。 gpload 需要提前编写一个 YAML-formated control file, 用于描述 source data location,format, transformations required, participating hosts, database destinations, and others. 这允许你执行一个复杂的加载任务。
    external table
Figure 1. External Tables Using Greenplum Parallel File Server (gpfdist)

以下只演示 gpfdistgpload 的使用:

gpfdist

gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout] 
   [-S] [-w time] [-v | -V] [-s] [-m max_length]
   [--ssl certificate_path [--sslclean wait_time] ]
   [-c config.yml]

示例

Uses the gpfdist protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe(|) and NULL (’ ') is a space. Access to the external table is single row error isolation mode. If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT value), the entire external table operation fails and no rows are processed.

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date, amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8082/*.txt')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   LOG ERRORS SEGMENT REJECT LIMIT 5;

To create the readable ext_expenses table from CSV-formatted text files:

=# CREATE EXTERNAL TABLE ext_expenses ( name text, 
   date date,  amount float4, category text, desc1 text ) 
   LOCATION ('gpfdist://etlhost-1:8081/*.txt', 
             'gpfdist://etlhost-2:8082/*.txt')
   FORMAT 'CSV' ( DELIMITER ',' )
   LOG ERRORS SEGMENT REJECT LIMIT 5;

以下是我的个人实验:

  1. 创建一个数据目录 ~/Datasets/baike, 将数据集baike_triple.txt移动到该目录下。

    这里的数据集来自 CN-DBpedia, 包含 900万+ 的 百科实体 以及 6700万+ 的 三元组关系。其中 mention2entity 信息 110万+,摘要信息 400万+,标签信息 1980万+,infobox 信息 4100万+。大小约为 4.0 G。 数据实例如下:

    "1+8"时代广场   中文名  "1+8"时代广场
    "1+8"时代广场   地点    咸宁大道与银泉大道交叉口
    "1+8"时代广场   实质    城市综合体项目
    "1+8"时代广场   总建面  约11.28万方
    "1.4"河南兰考火灾事故   中文名  "1.4"河南兰考火灾事故
    "1.4"河南兰考火灾事故   地点    河南<a>兰考县</a>城关镇
    "1.4"河南兰考火灾事故   时间    2013年1月4日
    
  2. 开启 gpfdist 后台:

    gpfdist -d ~/Datasets/baike -p 8081 > /tmp/gpfdist.log 2>&1 &
    ps -A | grep gpfdist # 查看进程号
    30693 pts/8    00:00:00 gpfdist  # 表示进程号为 30693
    

    选项说明:

    • -d directory: 指定一个目录,gpfdist 将从该目录中为可读外部表提供文件,或为可写外部表创建输出文件。如果没有指定,默认为当前目录。
    • -p http_port: gpfdist 提供文件要使用的HTTP端口。默认为8080。

    查看日志:

    gt@vm1:~$ more /tmp/gpfdist.log 
    2020-01-13 02:49:38 30693 INFO Before opening listening sockets - following listening sockets are a vailable:
    2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081
    2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081
    2020-01-13 02:49:38 30693 INFO Trying to open listening socket:
    2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081
    2020-01-13 02:49:38 30693 INFO Opening listening socket succeeded
    2020-01-13 02:49:38 30693 INFO Trying to open listening socket:
    2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081
    Serving HTTP on port 8081, directory /home/gt/Datasets/baike
    
  3. gpadmin 身份开启一个 psql session, 创建 tablesext_baike用于存放加载的数据,ext_load_baike_err 用于存放加载错误的日志。

psql -h localhost -d db_kg # 进入数据库 db_kg

# 创建外部表
CREATE EXTERNAL TABLE ext_baike (
head text, rel text, tail text) 
LOCATION ('gpfdist://vm1:8081/baike_triples.txt')
FORMAT 'TEXT' (DELIMITER E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 50000;

# 创建内部存储表
CREATE TABLE tb_baike (
id SERIAL PRIMARY KEY, head text, rel text, tail text);

创建外部表语法详细: CREATE EXTERNAL TABLE .

创建 外部表后,就可以直接从外部表读取数据了,例如:

db_kg=# select * from ext_baike limit 10;
      head       |   rel    |     tail      
-----------------+----------+---------------
 **            | 中文名   | **
 **            | 作者     | Amarantine
 **            | 小说进度 | 暂停
 **            | 连载网站 | 晋江文学城
 *西方犯罪学概论 | BaiduTAG | 书籍
 *西方犯罪学概论 | ISBN     | 9787811399967
 *西方犯罪学概论 | 作者     | 李明琪 编
 *西方犯罪学概论 | 出版时间 | 2010-4
 *西方犯罪学概论 | 定价     | 22.00*西方犯罪学概论 | 页数     | 305
(10 rows)
  1. 将外部表数据导入到内部表:

    INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
    

    由于 虚拟机的存储空间不足,最后运行失败。可以在日志中看到:

    more greenplum/data/data1/primary/gpseg0/pg_log/gpdb-2020-01-13_000000.csv
    2020-01-13 04:10:14.623480 UTC,,,p24832,th930718592,,,,0,,,seg0,,,,,"PANIC","53100","could not writ
    e to file ""pg_xlog/xlogtemp.24832"": No space left on device",,,,,,,0,,"xlog.c"
    

    start: 15:33:30

    abnormally end: 16:03

    错误:

    db_kg=# INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
    ERROR: gpfdist error: unknown meta type 108 (url_curl.c:1635) (seg0 slice1 127.0.1.1:6000 pid=15880) (url_curl.c:1635)
    CONTEXT: External table ext_baike, file gpfdist://vm1:8081/baike_triples.txt

    LIMIT 10000;

    start 16:10:00

    end: 17:12:42

    ERROR: interconnect encountered a network error, please check your network (seg3 slice1 192.168.5
    6.6:6001 pid=11612)
    DETAIL: Failed to send packet (seq 1) to 127.0.1.1:56414 (pid 15913 cid -1) after 3562 retries in
    3600 seconds

Cost-based Query Optimizer

当master接受到一条SQL语句,会将这条语句解析为执行计划 DAG,将 DAG 中不需要进行数据交换的划分为 slice ,join,aggregate,sort 的时候,都会涉及到 slice 的重分布,会有一个 motion 任务来执行数据的重分布。将 slice 下发到涉及到的相关 segment 中。来源:GreenPlum:基于PostgreSQL的分布式关系型数据库

参考 About Greenplum Query Processing
在这里插入图片描述

slice: To achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan, with one slice on each side of the motion.

motion: A motion operation involves moving tuples between the segments during query processing. Note that not every query requires a motion. For example, a targeted query plan does not require data to move across the interconnect.

  • a redistribute motion that moves tuples between the segments to complete the join.
  • A gather motion is when the segments send results back up to the master for presentation to the client. Because a query plan is always sliced wherever a motion occurs, this plan also has an implicit slice at the very top of the plan (slice 3).

Federated Data Access

Federao data access

MADlib 扩展

添加 madlib扩展

Pivotal Network 上面下载 MADlib 1.16+8 for RHEL 7.

使用 gppkg 安装:

出现错误:

gt@vm1:~/madlib-1.16+8-gp6-rhel7-x86_64$ gppkg -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg 
20200113:10:01:15:018921 gppkg:vm1:gt-[INFO]:-Starting gppkg with args: -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg
20200113:10:01:15:018921 gppkg:vm1:gt-[CRITICAL]:-gppkg failed. (Reason='__init__() takes exactly 17 arguments (16 given)') exiting...

最后, 重新安装了系统: Centos 7. 在 Pivotal Network 上说明了 madlib 可以安装的系统:
在这里插入图片描述
在这里插入图片描述
Pivotal Network 只给出了 Redhat 6.xRedhat 7.x 的 binary package. 之前装的 Ubuntu 18.04 并不能使用。Redhat Enterprise Linux 咱也用不起。。所以考虑使用免费的 Fedora or CentOS 系统。 我装的是 CentOS 7,

[gpadmin@vm1 ~]$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

安装的 Greenplum6.3 版本的。 因此下载的文件是 madlib-1.16+9-gp6-rhel7-x86_64.tar.gz, 解压后如下:

[gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ ls -la
total 3048
drwxr-xr-x. 2 gpadmin gpadmin     147 Jan 14 16:51 .
drwx------. 7 gpadmin gpadmin    4096 Jan 14 16:51 ..
-rw-r--r--. 1 gpadmin gpadmin 2904455 Jan 10 07:31 madlib-1.16+9-gp6-rhel7-x86_64.gppkg
-rw-r--r--. 1 gpadmin gpadmin  135530 Jan 10 07:31 open_source_license_MADlib_1.16_GA.txt
-rw-r--r--. 1 gpadmin gpadmin   61836 Jan 10 07:31 ReleaseNotes.txt

里面的 madlib-1.16+9-gp6-rhel7-x86_64.gppkg 文件需要使用 GreenPlum 的工具 gppkg 安装。 安装方式为:

${GPHOME}/bin/gppgk [-i <package>| -u <package> | -r <name-version> | -c] 
[-d <master_data_directory>] [-a] [-v]

安装:

${GPHOME}/bin/gppgk  -i madlib-1.16+9-gp6-rhel7-x86_64.gppkg

然后,安装 MADlib Object to Database:

$GPHOME/madlib/bin/madpack install -s madlib -p greenplum -c gpadmin@vm1:5432/testdb

安装记录如下(需要预先安装 m4, yum install m4):

[gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ $GPHOME/madlib/bin/madpack install -p greenplum
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
madpack.py: INFO : *** Installing MADlib ***
madpack.py: INFO : MADlib tools version    = 1.16 (/usr/local/greenplum-db-6.3.0/madlib/Versions/1.16/bin/../madpack/madpack.py)
madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib)
madpack.py: INFO : Testing PL/Python environment...
madpack.py: INFO : > PL/Python environment OK (version: 2.7.12)
madpack.py: INFO : > Preparing objects for the following modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
madpack.py: INFO : > - crf
madpack.py: INFO : > - elastic_net
madpack.py: INFO : > - linalg
madpack.py: INFO : > - pmml
madpack.py: INFO : > - prob
madpack.py: INFO : > - sketch
madpack.py: INFO : > - svec
madpack.py: INFO : > - svm
madpack.py: INFO : > - tsa
madpack.py: INFO : > - stemmer
madpack.py: INFO : > - conjugate_gradient
madpack.py: INFO : > - knn
madpack.py: INFO : > - lda
madpack.py: INFO : > - stats
madpack.py: INFO : > - svec_util
madpack.py: INFO : > - utilities
madpack.py: INFO : > - assoc_rules
madpack.py: INFO : > - convex
madpack.py: INFO : > - deep_learning
madpack.py: INFO : > - glm
madpack.py: INFO : > - graph
madpack.py: INFO : > - linear_systems
madpack.py: INFO : > - recursive_partitioning
madpack.py: INFO : > - regress
madpack.py: INFO : > - sample
madpack.py: INFO : > - summary
madpack.py: INFO : > - kmeans
madpack.py: INFO : > - pca
madpack.py: INFO : > - validation
madpack.py: INFO : Installing MADlib:
madpack.py: INFO : > Created madlib schema
madpack.py: INFO : > Created madlib.MigrationHistory table
madpack.py: INFO : > Wrote version info in MigrationHistory table
madpack.py: INFO : MADlib 1.16 installed successfully in madlib schema.

madlib 的使用

错误(在数据库 db_kg ):

ERROR: schema “madlib” does not exist

由于使用 madpack 安装 madlib时没有指定 database, 而在 db_kg 中使用 madlib 时会出现错误,使用 madpack install-check 检查:

[gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum -c gpadmin@vm1:5432/db_madlib_demo
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
madpack.py: INFO : MADlib is not installed in the schema madlib. Install-check stopped.

发现确实在 数据库 db_kg 中没有安装 madlib. 看看上面 madpack install 的日志,发现它默认选择的数据库是 gpadmin, 即用户名同名数据库:

madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib)

在 数据库 db_kg 上测试发现确实已经安装:

[gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum 
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
TEST CASE RESULT|Module: array_ops|array_ops.ic.sql_in|PASS|Time: 203 milliseconds
TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 1102 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 931 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 927 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 1041 milliseconds
TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 274 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 4158 milliseconds
TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 2050 milliseconds
TEST CASE RESULT|Module: pmml|pmml.ic.sql_in|PASS|Time: 2597 milliseconds
TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 67 milliseconds
TEST CASE RESULT|Module: svm|svm.ic.sql_in|PASS|Time: 1479 milliseconds
TEST CASE RESULT|Module: tsa|arima.ic.sql_in|PASS|Time: 2058 milliseconds
TEST CASE RESULT|Module: stemmer|porter_stemmer.ic.sql_in|PASS|Time: 107 milliseconds
TEST CASE RESULT|Module: conjugate_gradient|conj_grad.ic.sql_in|PASS|Time: 555 milliseconds
TEST CASE RESULT|Module: knn|knn.ic.sql_in|PASS|Time: 574 milliseconds
TEST CASE RESULT|Module: lda|lda.ic.sql_in|PASS|Time: 641 milliseconds
TEST CASE RESULT|Module: stats|anova_test.ic.sql_in|PASS|Time: 147 milliseconds
TEST CASE RESULT|Module: stats|chi2_test.ic.sql_in|PASS|Time: 174 milliseconds
TEST CASE RESULT|Module: stats|correlation.ic.sql_in|PASS|Time: 426 milliseconds
TEST CASE RESULT|Module: stats|cox_prop_hazards.ic.sql_in|PASS|Time: 586 milliseconds
TEST CASE RESULT|Module: stats|f_test.ic.sql_in|PASS|Time: 129 milliseconds
TEST CASE RESULT|Module: stats|ks_test.ic.sql_in|PASS|Time: 138 milliseconds
TEST CASE RESULT|Module: stats|mw_test.ic.sql_in|PASS|Time: 118 milliseconds
TEST CASE RESULT|Module: stats|pred_metrics.ic.sql_in|PASS|Time: 779 milliseconds
TEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.ic.sql_in|PASS|Time: 864 milliseconds
TEST CASE RESULT|Module: stats|t_test.ic.sql_in|PASS|Time: 145 milliseconds
TEST CASE RESULT|Module: stats|wsr_test.ic.sql_in|PASS|Time: 149 milliseconds
TEST CASE RESULT|Module: utilities|encode_categorical.ic.sql_in|PASS|Time: 425 milliseconds
TEST CASE RESULT|Module: utilities|minibatch_preprocessing.ic.sql_in|PASS|Time: 534 milliseconds
TEST CASE RESULT|Module: utilities|path.ic.sql_in|PASS|Time: 387 milliseconds
TEST CASE RESULT|Module: utilities|pivot.ic.sql_in|PASS|Time: 287 milliseconds
TEST CASE RESULT|Module: utilities|sessionize.ic.sql_in|PASS|Time: 245 milliseconds
TEST CASE RESULT|Module: utilities|text_utilities.ic.sql_in|PASS|Time: 334 milliseconds
TEST CASE RESULT|Module: utilities|transform_vec_cols.ic.sql_in|PASS|Time: 427 milliseconds
TEST CASE RESULT|Module: utilities|utilities.ic.sql_in|PASS|Time: 311 milliseconds
TEST CASE RESULT|Module: assoc_rules|assoc_rules.ic.sql_in|PASS|Time: 1328 milliseconds
TEST CASE RESULT|Module: convex|lmf.ic.sql_in|PASS|Time: 409 milliseconds
TEST CASE RESULT|Module: convex|mlp.ic.sql_in|PASS|Time: 2032 milliseconds
TEST CASE RESULT|Module: deep_learning|keras_model_arch_table.ic.sql_in|PASS|Time: 498 milliseconds
TEST CASE RESULT|Module: glm|glm.ic.sql_in|PASS|Time: 4514 milliseconds
TEST CASE RESULT|Module: graph|graph.ic.sql_in|PASS|Time: 4471 milliseconds
TEST CASE RESULT|Module: linear_systems|dense_linear_sytems.ic.sql_in|PASS|Time: 313 milliseconds
TEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.ic.sql_in|PASS|Time: 376 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|decision_tree.ic.sql_in|PASS|Time: 807 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|random_forest.ic.sql_in|PASS|Time: 649 milliseconds
TEST CASE RESULT|Module: regress|clustered.ic.sql_in|PASS|Time: 549 milliseconds
TEST CASE RESULT|Module: regress|linear.ic.sql_in|PASS|Time: 104 milliseconds
TEST CASE RESULT|Module: regress|logistic.ic.sql_in|PASS|Time: 720 milliseconds
TEST CASE RESULT|Module: regress|marginal.ic.sql_in|PASS|Time: 1230 milliseconds
TEST CASE RESULT|Module: regress|multilogistic.ic.sql_in|PASS|Time: 1052 milliseconds
TEST CASE RESULT|Module: regress|robust.ic.sql_in|PASS|Time: 498 milliseconds
TEST CASE RESULT|Module: sample|balance_sample.ic.sql_in|PASS|Time: 389 milliseconds
TEST CASE RESULT|Module: sample|sample.ic.sql_in|PASS|Time: 79 milliseconds
TEST CASE RESULT|Module: sample|stratified_sample.ic.sql_in|PASS|Time: 252 milliseconds
TEST CASE RESULT|Module: sample|train_test_split.ic.sql_in|PASS|Time: 506 milliseconds
TEST CASE RESULT|Module: summary|summary.ic.sql_in|PASS|Time: 457 milliseconds
TEST CASE RESULT|Module: kmeans|kmeans.ic.sql_in|PASS|Time: 2581 milliseconds
TEST CASE RESULT|Module: pca|pca.ic.sql_in|PASS|Time: 4804 milliseconds
TEST CASE RESULT|Module: pca|pca_project.ic.sql_in|PASS|Time: 1948 milliseconds
TEST CASE RESULT|Module: validation|cross_validation.ic.sql_in|PASS|Time: 756 milliseconds

所以在 db_kg 上安装 madlib 即可。

在数据库内检查是否已经安装 madlib:

[gpadmin@vm1 ~]$ psql -d db_madlib_demo
psql (9.4.24)
Type "help" for help.

db_madlib_demo=# \dn madlib
 List of schemas
  Name  |  Owner  
--------+---------
 madlib | gpadmin
(1 row)

正文开始…

MADlib 的使用详见: MADlib Documentation. 内容太多,只挑几个案例学习。

Array Operations

Array Operations, copy from Array Operations:

Operation Description
array_add() Adds two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
sum() Aggregate, sums vector element-wisely. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_sub() Subtracts two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_mult() Element-wise product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_div() Element-wise division of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_dot() Dot-product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_contains() Checks whether one array contains the other. This function returns TRUE if each non-zero element in the right array equals to the element with the same index in the left array.
array_max() This function finds the maximum value in the array. NULLs are ignored. Return type is the same as the input type.
array_max_index() This function finds the maximum value and corresponding index in the array. NULLs are ignored. Return type is array in format [max, index], and its element type is the same as the input type.
array_min() This function finds the minimum value in the array. NULLs are ignored. Return type is the same as the input type.
array_min_index() This function finds the minimum value and corresponding index in the array. NULLs are ignored. Return type is array in format [min, index], and its element type is the same as the input type.
array_sum() This function finds the sum of the values in the array. NULLs are ignored. Return type is the same as the input type.
array_sum_big() This function finds the sum of the values in the array. NULLs are ignored. Return type is always FLOAT8 regardless of input. This function is meant to replace array_sum() in cases when a sum may overflow the element type.
array_abs_sum() This function finds the sum of abs of the values in the array. NULLs are ignored. Return type is the same as the input type.
array_abs() This function takes an array as the input and finds abs of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
array_mean() This function finds the mean of the values in the array. NULLs are ignored.
array_stddev() This function finds the standard deviation of the values in the array. NULLs are ignored.
array_of_float() This function creates an array of set size (the argument value) of FLOAT8, initializing the values to 0.0.
array_of_bigint() This function creates an array of set size (the argument value) of BIGINT, initializing the values to 0.
array_fill() This functions set every value in the array to some desired value (provided as the argument).
array_filter() This function takes an array as the input and keep only elements that satisfy the operator on specified scalar. It requires that the array is 1-D and all the values are NON-NULL. Return type is the same as the input type. By default, this function removes all zeros.
array_scalar_mult() This function takes an array as the input and executes element-wise multiplication by the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_scalar_add() This function takes an array as the input and executes element-wise addition of the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type.
array_sqrt() This function takes an array as the input and finds square root of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
array_pow() This function takes an array and a float8 as the input and finds power of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
array_square() This function takes an array as the input and finds square of each element in the array, returning the resulting array. It requires that all the values are NON-NULL.
normalize() This function normalizes an array as sum of squares to be 1. It requires that the array is 1-D and all the values are NON-NULL.
array_unnest_2d_to_1d() This function takes a 2-D array as the input and unnests it by one level. It returns a set of 1-D arrays that correspond to rows of the input array as well as an ID column with values corresponding to row positions occupied by those 1-D arrays within the 2-D array.

示例:

  1. 创建数据表, 并插入数据

    CREATE TABLE array_tbl ( id integer NOT NULL PRIMARY KEY,
                             array1 integer[],
                             array2 integer[]
                           );
    INSERT INTO array_tbl VALUES
                           ( 1, '{1,2,3,4,5,6,7,8,9}', '{9,8,7,6,5,4,3,2,1}' ),
                           ( 2, '{1,1,0,1,1,2,3,99,8}','{0,0,0,-5,4,1,1,7,6}' );
    
  2. 使用 functions

    db_madlib_demo=# select id, 
    						madlib.array_sum(array1), 
    						madlib.array_sub(array2, array1), 
    						madlib.array_max(array1), 
    						madlib.array_min(array1), 
    						madlib.array_mean(array1), 
    						madlib.normalize(array1)
    				from array_tbl group by id;
    
     id | array_sum |          array_sub          | array_max | array_min |    array_mean    |                                                     
                             normalize                                                                               
    ----+-----------+-----------------------------+-----------+-----------+------------------+-----------------------------------------------------
    -----------------------------------------------------------------------------------------------------------------
      1 |        45 | {8,6,4,2,0,-2,-4,-6,-8}     |         9 |         1 |                5 | {0.0592348877759092,0.118469775551818,0.177704663327
    728,0.236939551103637,0.296174438879546,0.355409326655455,0.414644214431365,0.473879102207274,0.533113989983183}
      2 |       116 | {-1,-1,0,-6,3,-1,-2,-92,-2} |        99 |         0 | 12.8888888888889 | {0.0100595273380576,0.0100595273380576,0,0.010059527
    3380576,0.0100595273380576,0.0201190546761152,0.0301785820141728,0.995893206467704,0.0804762187044609}
    (2 rows)
    
    

Low-Rank Matrix Faxtorization

这个模块用一个low-rank approximation 实现了 incomplete matrix 的分解。

Mathematically, this model seeks to find matrices U and V (also referred as factors) that, for any given incomplete matrix A, minimizes:

AUVT2||A - UV^T||_2, s.t. rank(UVT)<=rrank(UV^T) <= r,

where .2||.||_2 denotes the Frobenius norm. Let AA be a m×nm×n matrix, then UU will be m×rm×r and VV will be n×rn×r, in dimension, and 1rmin(m,n)1≤r≪min(m,n). This model is not intended to do the full decomposition, or to be used as part of inverse procedure. This model has been widely used in recommendation systems (e.g., Netflix [2]) and feature selection (e.g., image processing [3]).

Function Syntax

lmf_igd_run( rel_output, # TEXT. The name of the table to receive the output.
             rel_source, # TEXT. The name of the table containing the input data.{sparse}
             col_row,	 # TEXT. The name of the column containing the row number.
             col_column, # TEXT. The name of the column containing the column number.
             col_value,	 # DOUBLE PRECISION. The value at (row, col).
             row_dim,	 # INTEGER, default: "SELECT max(col_row) FROM rel_source".
             column_dim, # INTEGER, default: "SELECT max(col_col) FROM rel_source".
             max_rank,	 # INTEGER, default: 20. The rank of desired approximation.
             stepsize,	 # DOUBLE PRECISION, default: 0.01. Hyper-parameter that decides how aggressive the gradient steps are.
             scale_factor, # DOUBLE PRECISION, default: 0.1. Hyper-parameter that decides scale of initial factors.
             num_iterations, # INTEGER, default: 10. Maximum number if iterations to perform regardless of convergence.
             tolerance # DOUBLE PRECISION, default: 0.0001. Acceptable level of error in convergence.
           )

Examples

  1. Create table, insert data.

    CREATE TABLE lmf_data (row INT, col INT, val FLOAT8, primary key (row, col));
    INSERT INTO lmf_data VALUES (1, 1, 5.0), (3, 100, 1.0), (999, 10000, 2.0);
    
  2. execute function

db_madlib_demo=# SELECT madlib.lmf_igd_run('lmf_model', 'lmf_data', 'row', 'col', 'val', 999,10000, 3, 0.1, 2, 10, 1e-9 );
NOTICE:  Matrix lmf_data to be factorized: 999 x 10000
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CONTEXT:  SQL statement "
            CREATE TABLE lmf_model (
                id          SERIAL,
                matrix_u    DOUBLE PRECISION[],
                matrix_v    DOUBLE PRECISION[],
                rmse        DOUBLE PRECISION)"
PL/pgSQL function madlib.lmf_igd_run(character varying,regclass,character varying,character varying,character varying,integer,integer,integer,double precision,double precision,integer,double precision) line 47 at EXECUTE statement
NOTICE:  
Finished low-rank matrix factorization using incremental gradient
DETAIL:  
 * table : lmf_data (row, col, val)
Results:
 * RMSE = 3.61661832699015e-06
Output:
 * view : SELECT * FROM lmf_model WHERE id = 1
 lmf_igd_run 
-------------
           1
(1 row)
  1. Check the result.

    db_madlib_demo=# SELECT array_dims(matrix_u) AS u_dims, array_dims(matrix_v) AS v_dims
    	FROM lmf_model
    	WHERE id=1;
    	
        u_dims    |     v_dims     
    --------------+----------------
     [1:999][1:3] | [1:10000][1:3]
    (1 row)
    
  2. Query the result value

    db_madlib_demo=# SELECT matrix_u[2:2][1:3] AS row_2_features 
    	FROM lmf_model
    	WHERE id = 1;
    
                         row_2_features                      
    ---------------------------------------------------------
     {{1.97037281095982,0.312463999725878,1.06016968935728}}
    (1 row)
    
  3. Make prediction of a missing entry (row=2, col=7654).

    db_madlib_demo=# SELECT madlib.array_dot(
            matrix_u[2:2][1:3],
            matrix_v[7654:7654][1:3]
    	) AS row_2_col_7654
    		FROM lmf_model
    		WHERE id = 1;
    		
      row_2_col_7654  
    ------------------
     2.37682774869935
    (1 row)
    
    

Singular Value Decomposition

# SVD Function for Dense Matrices
svd( source_table, 			# TEXT. Source table name (dense matrix).
     output_table_prefix,	# TEXT. Prefix for output tables. 
     row_id,				# TEXT. ID for each row.
     k,						# INTEGER. Number of singular values to compute.
     n_iterations,			# INTEGER. Number of iterations to run.
     result_summary_table	# TEXT. The name of the table to store the result summary.
);

# SVD Function for Sparse Matrices
svd_sparse( source_table,
    output_table_prefix,
    row_id,
    col_id,
    value,
    row_dim,
    col_dim,
    k,
    n_iterations,
    result_summary_table
    );

Neural Network

Classification Training Function

mlp_classification(
    source_table,
    output_table,
    independent_varname,
    dependent_varname,
    hidden_layer_sizes,
    optimizer_params,
    activation,
    weights,
    warm_start,
    verbose,
    grouping_col
    )

Regression Training Function

mlp_regression(
    source_table,
    output_table,
    independent_varname,
    dependent_varname,
    hidden_layer_sizes,
    optimizer_params,
    activation,
    weights,
    warm_start,
    verbose,
    grouping_col
    )

Optimizer Parameters

  'learning_rate_init = <value>,
   learning_rate_policy = <value>,
   gamma = <value>,
   power = <value>,
   iterations_per_step = <value>,
   n_iterations = <value>,
   n_tries = <value>,
   lambda = <value>,
   tolerance = <value>,
   batch_size = <value>,
   n_epochs = <value>,
   momentum = <value>,
   nesterov = <value>'

Prediction Function

mlp_predict(
    model_table,
    data_table,
    id_col_name,
    output_table,
    pred_type
    )

还有其他很多的函数,没有一一的实验,需要用到了再查找吧。

总之,这个东西,个人感觉是 数据库的 边缘用例 了。而 GreenPlum 的核心特性是 Analysis, 但是吧,这个东西是 Apache 的,到底可不可以算是卖点呢。。而且,使用数据库来分析数据,emmm, 总感觉有点违和。并且,Greenplum 本身也不稳定,bug 还是比较多。前面加载海量数据就失败了(gpfdist)。

当然,这个工具是方便的,在数据存储的地方就把数据处理完了。

总之,知道有这个东西就好了。

发布了5 篇原创文章 · 获赞 0 · 访问量 325
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览