I've
found several tools that claim to implement TPC-H like benchmark,
but I've found them unusable for various reasons. For
example DBT-3 is a
bit old (last update in 2005) and the dbgen command keeps failing
for strange reasons.
The
tool from TPC Council works quite well, and although it does not
support PostgreSQL out of the box, it's not very difficult to make
it work. You can get it right at the TPC-H page (see
the right column, below the TPC-H specification).
Download the tgz package (37MB).
compiling
After extracting it, you have to prepare a Makefile - enter the
dbgen directory, copy the makefile.suite and modify those four
variables (about line 109):
CC=gcc
DATABASE=ORACLE
MACHINE=LINUX
WORKLOAD=TPCH
If
you're not running Linux, set the MACHINE accordingly. Then just
build the tool using make.
generating data
So
now we have a "dbgen" tool that generates data in a CSV format.
Just like pgbench it has a scaling factor that influences the
amount of data generater - the default value (1) means about 1GB of
raw data, i.e. about 2GB of data after populating the database. So
let's create 10GB of raw data:
$ ./dbgen -s 10
That
gives us eight .tbl files with a CSV format, each containing data
for one table. The problem is each row contains an extra "|"
separator at the end of the line, so PostgreSQL fails to load that.
But it's quite easily fixable with sed - just run this:
for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done;
Now
we have eight CSV files that may be loaded into the database. But
we have to create it first.
populating the database
Although the TPC-H
specification describes the database structure, the create scripts are not part of the package.
I've prepared a create script that
creates all the tables and
an alter script that
creates the foreign keys (after the database is populated). So
create a database and create the tables
$ createdb tpch
$ psql tpch < tpch-create.sql
Now
it's time to populate the database with generated data. Don't
forget the COPY requires
absolute paths, so you may use a script like this
dir=`pwd`
opts="-h localhost pgbench"
psql $opts -c "COPY part FROM '$dir/part.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY region FROM '$dir/region.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY nation FROM '$dir/nation.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY supplier FROM '$dir/supplier.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY customer FROM '$dir/customer.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY partsupp FROM '$dir/partsupp.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY orders FROM '$dir/orders.csv'
WITH (FORMAT csv, DELIMITER '|')"
psql $opts -c "COPY lineitem FROM '$dir/lineitem.csv'
WITH (FORMAT csv, DELIMITER '|')"
and
finally create the foreign keys
$ psql tpch < tpch-alter.sql
So
now we have a database populated with data, so let's prepare some
queries.
generating queries
TPC-H describes 22 queries (or rather templates), stored in the
"queries" directory. Then there is a "qgen" tool that generates
queries from the templates (mostly fill them with random data). We
have to fix some minor problems before running it.
First, most of the queries uses a ROWCOUNT (or other db-specific
variant) instead of LIMIT. This is quite easy to fix. Second, there
are about five queries that use correlated aggregate subqueries
with huge outer tables - that does not work very well in
PostgreSQL, but it's easy to rewrite to a JOIN. So intead of query
like this
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = ':1'
and p_container = ':2'
and l_quantity < (
select
0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
)
LIMIT 1;
you
get a query like this
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(SELECT l_partkey AS agg_partkey,
0.2 * avg(l_quantity) AS avg_quantity
FROM lineitem GROUP BY l_partkey) part_agg
where
p_partkey = l_partkey
and agg_partkey = l_partkey
and p_brand = ':1'
and p_container = ':2'
and l_quantity < avg_quantity
LIMIT 1;
that
produces the same result and is much faster (actually the first
query never finished for me). I'm not quite sure how strict the
TPC-H is about the query format, but my goal was not to to a fully
compliant TPC-H benchmark, it's rather a good starting point to do
a DSS benchmark.
The
modified queries are available here - just extract
it into a new 'queries-pg' directory.
$ mkdir queries-pg
$ tar -xzvf ~/tpch-queries.tgz
and
you're ready to generate the workload. This
$ DSS_QUERY=queries-pg ./qgen > workload.sql
gives you a script that executes one query for each template. If
you want to generate a workload with only some of the queries, list
the query IDs at the command line like this
$ DSS_QUERY=queries-pg ./qgen 2 3 7 > workload.sql
That
will give you a workload with queries 2.sql, 3.sql and 7.sql only.
If you want to generate more queries, you may do this
for r in `seq 1 10`
do
rn=$((`cat /dev/urandom|od -N3 -An -i` % 10000))
DSS_QUERY=queries-pg ./qgen -r $rn >> workload.sql
done
Notice how the qgen is initialized with a different random number
each time (by default it uses a timestamp, so executing it multiple
times within a second gives exectly the same queries).
running the workload
So
we have a populated database and we know hot to generate a
workload, but how to run it? I've been thinking about the pgbench
"custom scripts" thing but it would be quite difficult to generate
all the various random values.
I'm
using a very simple approach - decide how many "clients" I want to
run (with DSS benchmark this tends to be quite low number, e.g. 4),
prepare corresponding number of workload scripts (e.g.
workload-1.sql .. workload-4.sql) and then do something like
this
#start the processes
for c in `seq 1 4`
do
/usr/bin/time -f "total=%e" -o result-$c.log \
psql tpch < workload-$c.sql > /dev/null 2>&1 &
done;
# wait for the processes
for p in `jobs -p`
do
wait $p;
done;
This
produces a bunch of result logs with number of seconds for each
workload. Obviously there's a lot of things you can do - collect
data from various tools (iostat, vmstat, ...) and so on.
Keep
in mind that those queries have to chew through a lot of the data
(depending on the the scale factor you've chosen), and when
something goes wrong (e.g. a bad plan is chosen), it will take
forewer. That's why I usually create a batch of scripts with one
query per file, and then kill the query if it does not finish in a
given time frame (e.g. 15 minutes). A timeout command may be used
to do that
# prepare queries and execute them with 60 seconds timeout
for q in `seq 1 22`
do
DSS_QUERY=queries-pg ./qgen $q > q-$q.sql
timeout 60 psql tpch < q-$q.sql
done
The
random number is not needed anymore as the queries tend to take
more than one second.