tpc h oracle,PostgreSQL TPC-H测试

I've

found several tools that claim to implement TPC-H like benchmark,

but I've found them unusable for various reasons. For

example DBT-3 is a

bit old (last update in 2005) and the dbgen command keeps failing

for strange reasons.

The

tool from TPC Council works quite well, and although it does not

support PostgreSQL out of the box, it's not very difficult to make

it work. You can get it right at the TPC-H page (see

the right column, below the TPC-H specification).

Download the tgz package (37MB).

compiling

After extracting it, you have to prepare a Makefile - enter the

dbgen directory, copy the makefile.suite and modify those four

variables (about line 109):

CC=gcc

DATABASE=ORACLE

MACHINE=LINUX

WORKLOAD=TPCH

If

you're not running Linux, set the MACHINE accordingly. Then just

build the tool using make.

generating data

So

now we have a "dbgen" tool that generates data in a CSV format.

Just like pgbench it has a scaling factor that influences the

amount of data generater - the default value (1) means about 1GB of

raw data, i.e. about 2GB of data after populating the database. So

let's create 10GB of raw data:

$ ./dbgen -s 10

That

gives us eight .tbl files with a CSV format, each containing data

for one table. The problem is each row contains an extra "|"

separator at the end of the line, so PostgreSQL fails to load that.

But it's quite easily fixable with sed - just run this:

for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done;

Now

we have eight CSV files that may be loaded into the database. But

we have to create it first.

populating the database

Although the TPC-H

specification describes the database structure, the create scripts are not part of the package.

I've prepared a create script that

creates all the tables and

an alter script that

creates the foreign keys (after the database is populated). So

create a database and create the tables

$ createdb tpch

$ psql tpch < tpch-create.sql

Now

it's time to populate the database with generated data. Don't

forget the COPY requires

absolute paths, so you may use a script like this

dir=`pwd`

opts="-h localhost pgbench"

psql $opts -c "COPY part FROM '$dir/part.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY region FROM '$dir/region.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY nation FROM '$dir/nation.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY supplier FROM '$dir/supplier.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY customer FROM '$dir/customer.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY partsupp FROM '$dir/partsupp.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY orders FROM '$dir/orders.csv'

WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY lineitem FROM '$dir/lineitem.csv'

WITH (FORMAT csv, DELIMITER '|')"

and

finally create the foreign keys

$ psql tpch < tpch-alter.sql

So

now we have a database populated with data, so let's prepare some

queries.

generating queries

TPC-H describes 22 queries (or rather templates), stored in the

"queries" directory. Then there is a "qgen" tool that generates

queries from the templates (mostly fill them with random data). We

have to fix some minor problems before running it.

First, most of the queries uses a ROWCOUNT (or other db-specific

variant) instead of LIMIT. This is quite easy to fix. Second, there

are about five queries that use correlated aggregate subqueries

with huge outer tables - that does not work very well in

PostgreSQL, but it's easy to rewrite to a JOIN. So intead of query

like this

select

sum(l_extendedprice) / 7.0 as avg_yearly

from

lineitem,

part

where

p_partkey = l_partkey

and p_brand = ':1'

and p_container = ':2'

and l_quantity < (

select

0.2 * avg(l_quantity)

from

lineitem

where

l_partkey = p_partkey

)

LIMIT 1;

you

get a query like this

select

sum(l_extendedprice) / 7.0 as avg_yearly

from

lineitem,

part,

(SELECT l_partkey AS agg_partkey,

0.2 * avg(l_quantity) AS avg_quantity

FROM lineitem GROUP BY l_partkey) part_agg

where

p_partkey = l_partkey

and agg_partkey = l_partkey

and p_brand = ':1'

and p_container = ':2'

and l_quantity < avg_quantity

LIMIT 1;

that

produces the same result and is much faster (actually the first

query never finished for me). I'm not quite sure how strict the

TPC-H is about the query format, but my goal was not to to a fully

compliant TPC-H benchmark, it's rather a good starting point to do

a DSS benchmark.

The

modified queries are available here - just extract

it into a new 'queries-pg' directory.

$ mkdir queries-pg

$ tar -xzvf ~/tpch-queries.tgz

and

you're ready to generate the workload. This

$ DSS_QUERY=queries-pg ./qgen > workload.sql

gives you a script that executes one query for each template. If

you want to generate a workload with only some of the queries, list

the query IDs at the command line like this

$ DSS_QUERY=queries-pg ./qgen 2 3 7 > workload.sql

That

will give you a workload with queries 2.sql, 3.sql and 7.sql only.

If you want to generate more queries, you may do this

for r in `seq 1 10`

do

rn=$((`cat /dev/urandom|od -N3 -An -i` % 10000))

DSS_QUERY=queries-pg ./qgen -r $rn >> workload.sql

done

Notice how the qgen is initialized with a different random number

each time (by default it uses a timestamp, so executing it multiple

times within a second gives exectly the same queries).

running the workload

So

we have a populated database and we know hot to generate a

workload, but how to run it? I've been thinking about the pgbench

"custom scripts" thing but it would be quite difficult to generate

all the various random values.

I'm

using a very simple approach - decide how many "clients" I want to

run (with DSS benchmark this tends to be quite low number, e.g. 4),

prepare corresponding number of workload scripts (e.g.

workload-1.sql .. workload-4.sql) and then do something like

this

#start the processes

for c in `seq 1 4`

do

/usr/bin/time -f "total=%e" -o result-$c.log \

psql tpch < workload-$c.sql > /dev/null 2>&1 &

done;

# wait for the processes

for p in `jobs -p`

do

wait $p;

done;

This

produces a bunch of result logs with number of seconds for each

workload. Obviously there's a lot of things you can do - collect

data from various tools (iostat, vmstat, ...) and so on.

Keep

in mind that those queries have to chew through a lot of the data

(depending on the the scale factor you've chosen), and when

something goes wrong (e.g. a bad plan is chosen), it will take

forewer. That's why I usually create a batch of scripts with one

query per file, and then kill the query if it does not finish in a

given time frame (e.g. 15 minutes). A timeout command may be used

to do that

# prepare queries and execute them with 60 seconds timeout

for q in `seq 1 22`

do

DSS_QUERY=queries-pg ./qgen $q > q-$q.sql

timeout 60 psql tpch < q-$q.sql

done

The

random number is not needed anymore as the queries tend to take

more than one second.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值