使用tpc-ds生成测试数据过程

最新推荐文章于 2024-05-27 19:35:19 发布

白帽的帽子

最新推荐文章于 2024-05-27 19:35:19 发布

阅读量6k

点赞数

分类专栏： Greenplum

Greenplum 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

###原文地址：http://www.innovation-brigade.com/index.php?module=Content&type=user&func=display&tid=1&pid=3&lang=en

if you ever find yourself in need to generate massive quantities of benchmark data to test your database's data-import or query performance, the TPC (Transaction Processing Performance Council) provides a handy tool which can easily generate gigabytes of data. Yes, the data it generates and the queries it provides are geared towards decision support applications, but that doesn't prevent these scripts from being a good testing ground for your database; especially if you wish to compare performance on several database platforms.

While the TPC provides a whole range of benchmark suites for various purposes, the TPC-DS benchmark is probably the easiest to implement and use. Best of all, it's free (but does require you to submit your data for a download request) and on a modern Linux box, compiles out of the box without having to resort to any hacking. If you're using Mac OS X, it's not quite as easy as it will generate compile errors which you have to fix manually. So for the purposes of using the TPC-DS benchmark, do yourself a favour and use a Linux box to generate the data.

Here's what do need to do in order to set the TPC-DS benchmark up (on a Linux box):

- Download the DSGen utility (duh!)
- Extract the downloaded archive
- cd TPC-DS
- cd tools
- make (ignore the compile warnings, just ensure the build process completes successfully)

OK, you've now built the required utlities to use the benchmark. At this point it's probably useful to download and read the provided documentation in order to better understand the scope of what the benchmark provides, but here's a quick rundown of how to generate the testdata.

./dsdgen

will simply generate the test data (which is generated into | delimited text files with the extension *.dat) at the default scale factor (which is 1). Each scale factor corresponds to roughly 1GB of data, so, for example, the command

./dsdgen -scale 5 -force

will generate 5 GB of data and the -force option will overwrite previously generated data. Without the -force option, dsdgen will refuse to overwrite existing test data and simply do nothing.

Now that you have your test data ready, you can load it into your database. For MySql this rougly involves the following:

mysql -u <your_mysql_user> -p < tpcds.sql
</your_mysql_user>

Then for each *.dat file which was generated, do the following (see thispage for details:

LOAD DATA INFILE 'your_DAT_filename'
INTO TABLE table_the_DAT_file_is_for
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'

to load the data. See this previous blog post on how to setup MySql with InnoDB tables for good import performance so you don't spend lots of time waiting just because your database writes to disk after every insert. So this is your quick'n'dirty guide on how to use TPC-DS to generate and load lots of test data. Hopefully someone out there finds this useful. Enjoy.

相关参数：

[root@ht-hadoop3 tools]# ./dsdgen --help
ERROR: option '-help' or its argument unknown.
DBGEN2 Population Generator (Version 2.0.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2015

USAGE: DBGEN2 [options]

Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '-param [optional argument]'
Unique anchored substrings of options are also recognized, and
case is ignored, so '-sc' is equivalent to '-SCALE'

General Options
===============
ABREVIATION = <s> -- build table with abreviation <s>
DIR = <s> -- generate tables in directory <s>
HELP = <n> -- display this message
PARAMS = <s> -- read parameters from file <s>
QUIET = [Y|N] -- disable all output to stdout/stderr
SCALE = <n> -- volume of data to generate in GB
TABLE = <s> -- build only table <s>
UPDATE = <n> -- generate update data set <n>
VERBOSE = [Y|N] -- enable verbose output
PARALLEL = <n> -- build data in <n> separate chunks
CHILD = <n> -- generate <n>th chunk of the parallelized data
RELEASE = [Y|N] -- display the release information
_FILTER = [Y|N] -- output data to stdout
VALIDATE = [Y|N] -- produce rows for data validation

Advanced Options
===============
DELIMITER = <s> -- use <s> as output field separator
DISTRIBUTIONS = <s> -- read distributions from file <s>
FORCE = [Y|N] -- over-write data files without prompting
SUFFIX = <s> -- use <s> as output file suffix
TERMINATE = [Y|N] -- end each record with a field delimiter
VCOUNT = <n> -- set number of validation rows to be produced
VSUFFIX = <s> -- set file suffix for data validation
RNGSEED = <n> -- set RNG seed

白帽的帽子

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
使用tpc-ds生成测试数据过程

###原文地址：http://www.innovation-brigade.com/index.php?module=Content&type=user&func=display&tid=1&pid=3&lang=enif you ever find yourself in need to generate massive quantities of benchmark dat
复制链接

扫一扫