TPC-H数据导入PG数据库

本文介绍TPC-H Benckark装载到PG的方法


0. What is TPCH

TPC- H 主要目的是评价特定查询的决策支持能力,强调服务器在数据挖掘、分析处理方面的能力。查询是决策支持应用的最主要应用之一,数据仓库中的复杂查询可以分成两种类型:一种是预先知道的查询,如定期的业务报表;另一种则是事先未知的查询,称为动态查询(Ad-Hoc Query)。

通俗的讲,TPC-H就是当一家数据库开发商开发了一个新的数据库系统,采用TPC-H作为测试基准,来测试衡量数据库操作系统查询决策支持方面的能力。

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
[ TPC-H website]

1. How to generate TPC-H Data

download tpc-h-tool.zip, TPC-H网站下载数据生成工具 dbgen

zhuqingping@ubuntu:~/Study/BenchMark/TPCH$ unzip tpc-h-tool.zip
zhuqingping@ubuntu:~/Study/BenchMark/TPCH$ cd tpch_2_17_0/
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0$ ls
dbgen  dev-tools  ref_data
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0$ cd dbgen/
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ ls
answers     build.c          dists.dss  dsstypes.h      permute.h      queries    rnd.h         tests     tpch.vcproj
bcd2.c      check_answers    driver.c   HISTORY         PORTING.NOTES  README     rng64.c       text.c    update_release.sh
bcd2.h      column_split.sh  dss.ddl    load_stub.c     print.c        reference  rng64.h       tpcd.h    variants
bm_utils.c  config.h         dss.h      makefile.suite  qgen.c         release.h  shared.h      tpch.dsw  varsub.c
BUGS        dbgen.dsp        dss.ri     permute.c       qgen.vcproj    rnd.c      speed_seed.c  tpch.sln
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ mv makefile.suite makefile
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ vi makefile
100 ################
101 ## CHANGE NAME OF ANSI COMPILER HERE
102 ################
103 CC      = gcc
104 # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
105 #                                  SQLSERVER, SYBASE, ORACLE, VECTORWISE
106 # Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS, 
107 #                                  SGI, SUN, U2200, VMS, LINUX, WIN32 
108 # Current values for WORKLOAD are:  TPCH
109 DATABASE=ORACLE 
110 MACHINE = LINUX
111 WORKLOAD = TPCH

zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ make

编译成功后目录下生成TPC-H表定义文件及随机数据生成工具
表定义文件为:dss.ddl(表定义) && dss.ri(表约束)
数据生成工具:dbgen

zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ ./dbgen -h
TPC-H Population Generator (Version 2.17.0 build 0)
Copyright Transaction Processing Performance Council 1994 - 2010
USAGE:
dbgen [-{vf}][-T {pcsoPSOL}]
    [-s <scale>][-C <procs>][-S <step>]
dbgen [-v] [-O m] [-s <scale>] [-U <updates>]

Basic Options
===========================
-C <n> -- separate data set into <n> chunks (requires -S, default: 1)
-f     -- force. Overwrite existing files
-h     -- display this message
-q     -- enable QUIET mode
-s <n> -- set Scale Factor (SF) to  <n> (default: 1) 
-S <n> -- build the <n>th step of the data/update set (used with -C or -U)
-U <n> -- generate <n> update sets
-v     -- enable VERBOSE mode

Advanced Options
===========================
-b <s> -- load distributions for <s> (default: dists.dss)
-d <n> -- split deletes between <n> files (requires -U)
-i <n> -- split inserts between <n> files (requires -U)
-T c   -- generate cutomers ONLY
-T l   -- generate nation/region ONLY
-T L   -- generate lineitem ONLY
-T n   -- generate nation ONLY
-T o   -- generate orders/lineitem ONLY
-T O   -- generate orders ONLY
-T p   -- generate parts/partsupp ONLY
-T P   -- generate parts ONLY
-T r   -- generate region ONLY
-T s   -- generate suppliers ONLY
-T S   -- generate partsupp ONLY

To generate the SF=1 (1GB), validation database population, use:
    dbgen -vf -s 1

To generate updates for a SF=1 (1GB), use:
    dbgen -v -U 1 -s 1
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ ./dbgen -vf -s 1
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.
zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ ll -th | grep tbl
-rw-rw-r-- 1 zhuqingping zhuqingping  24M  127 17:32 customer.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping 725M  127 17:32 lineitem.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping 2.2K  127 17:32 nation.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping 164M  127 17:32 orders.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping 114M  127 17:32 partsupp.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping  24M  127 17:32 part.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping  389  127 17:32 region.tbl
-rw-rw-r-- 1 zhuqingping zhuqingping 1.4M  127 17:32 supplier.tbl

2. How To Create TPC-H table in PG(Postgres) and Load Data

dss.ddl中SQL可以在PG中直接执行,创建TPCH的表
dss.ri中的表约束,需要依照PG语法进行相应改造,简单验证TPC-H功能可忽略此步骤

zhuqingping@ubuntu:~/Study/BenchMark/TPCH/tpch_2_17_0/dbgen$ psql -dpostgres
psql (9.6.1)
Type "help" for help.

postgres=# drop schema if exists tpch;
NOTICE:  schema "tpch" does not exist, skipping
DROP SCHEMA
postgres=# create schema tpch;
CREATE SCHEMA
postgres=# set search_path to tpch;
SET
postgres=# \d+
No relations found.
postgres=# \i dss.ddl
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
postgres=# \d+
                        List of relations
 Schema |   Name   | Type  |    Owner    |  Size   | Description 
--------+----------+-------+-------------+---------+-------------
 tpch   | customer | table | zhuqingping | 0 bytes | 
 tpch   | lineitem | table | zhuqingping | 0 bytes | 
 tpch   | nation   | table | zhuqingping | 0 bytes | 
 tpch   | orders   | table | zhuqingping | 0 bytes | 
 tpch   | part     | table | zhuqingping | 0 bytes | 
 tpch   | partsupp | table | zhuqingping | 0 bytes | 
 tpch   | region   | table | zhuqingping | 0 bytes | 
 tpch   | supplier | table | zhuqingping | 0 bytes | 
(8 rows)

Load Data
直接load生成的数据会报错,原因在于dbgen生成的数据在每行的结尾处有分割符,者导致PG无法正确判断字段数,导入报错。

postgres=# copy part from '/home/zhuqingping/Study/BenchMark/TPCH/tpch_2_17_0/dbgen/part.tbl' with(delimiter '|', null '');
ERROR:  extra data after last expected column
CONTEXT:  COPY part, line 1: "1|goldenrod lavender spring chocolate lace|Manufacturer#1|Brand#13|PROMO BURNISHED COPPER|7|JUMBO PK..."

针对本问题,需要对数据进行改造,去掉每一行最后的分隔符,C语言实现如下:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
    if(argc < 2)
        printf("wrong use, %d\n", argc);
    FILE *fp1 = fopen(argv[1], "r");
    FILE *fp2 = fopen(argv[2], "w");
    char str_data[1000];
    int len = 0;
    while((fgets(str_data, 1000, fp1) != NULL) && strlen(str_data) > 2)
    {
        len = strlen(str_data);
        str_data[len - 2] = '\0';
        fprintf(fp2, "%s\n", str_data);
    }
    fclose(fp1);
    fclose(fp2);
    return 0;
}

对dbgen生成的数据改造后,就可以正确的导入到表中了。

zhuqingping@ubuntu:~/Study/BenchMark/TPCH/CREATE_TABLE$ cat load_data.sql 
-- load data
delete from tpch.customer;
copy tpch.customer from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/customer.tbl.bak' with(delimiter '|', null '');

delete from tpch.lineitem;
copy tpch.lineitem from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/lineitem.tbl.bak' with(delimiter '|', null '');

delete from tpch.nation;
copy tpch.nation from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/nation.tbl.bak' with(delimiter '|', null '');

delete from tpch.orders;
copy tpch.orders from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/orders.tbl.bak' with(delimiter '|', null '');

delete from tpch.partsupp;
copy tpch.partsupp from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/partsupp.tbl.bak' with(delimiter '|', null '');

delete from tpch.part;
copy tpch.part from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/part.tbl.bak' with(delimiter '|', null '');

delete from tpch.region;
copy tpch.region from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/region.tbl.bak' with(delimiter '|', null '');

delete from tpch.supplier;
copy tpch.supplier from '/home/zhuqingping/Study/BenchMark/TPCH/DATA/supplier.tbl.bak' with(delimiter '|', null '');

zhuqingping@ubuntu:~/Study/BenchMark/TPCH/CREATE_TABLE$ psql -dpostgres -f load_data.sql 
DELETE 0
COPY 150000
DELETE 0
COPY 6001215
DELETE 0
COPY 25
DELETE 0
COPY 1500000
DELETE 0
COPY 800000
DELETE 0
COPY 200000
DELETE 0
COPY 5
DELETE 0
COPY 10000

zhuqingping@ubuntu:~/Study/BenchMark/TPCH/CREATE_TABLE$ psql -dpostgres
psql (9.6.1)
Type "help" for help.

postgres=# set search_path to tpch;
SET
postgres=# \d+
                         List of relations
 Schema |   Name   | Type  |    Owner    |    Size    | Description 
--------+----------+-------+-------------+------------+-------------
 tpch   | customer | table | zhuqingping | 28 MB      | 
 tpch   | lineitem | table | zhuqingping | 868 MB     | 
 tpch   | nation   | table | zhuqingping | 8192 bytes | 
 tpch   | orders   | table | zhuqingping | 201 MB     | 
 tpch   | part     | table | zhuqingping | 32 MB      | 
 tpch   | partsupp | table | zhuqingping | 135 MB     | 
 tpch   | region   | table | zhuqingping | 8192 bytes | 
 tpch   | supplier | table | zhuqingping | 1784 kB    | 
(8 rows)

数据导入成功,analyze之后就可以进行查询分析了。


©️2020 CSDN 皮肤主题: 大白 设计师: CSDN官方博客 返回首页
实付0元
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值