使用tpc-ds生成测试数据过程

###原文地址:http://www.innovation-brigade.com/index.php?module=Content&type=user&func=display&tid=1&pid=3&lang=en


if you ever find yourself in need to generate massive quantities of benchmark data to test your database's data-import or query performance, the TPC (Transaction Processing Performance Council) provides a handy tool which can easily generate gigabytes of data. Yes, the data it generates and the queries it provides are geared towards decision support applications, but that doesn't prevent these scripts from being a good testing ground for your database; especially if you wish to compare performance on several database platforms.


While the TPC provides a whole range of benchmark suites for various purposes, the TPC-DS benchmark is probably the easiest to implement and use. Best of all, it's free (but does require you to submit your data for a download request) and on a modern Linux box, compiles out of the box without having to resort to any hacking. If you're using Mac OS X, it's not quite as easy as it will generate compile errors which you have to fix manually. So for the purposes of using the TPC-DS benchmark, do yourself a favour and use a Linux box to generate the data.


Here's what do need to do in order to set the TPC-DS benchmark up (on a Linux box):


- Download the DSGen utility (duh!)
- Extract the downloaded archive
- cd TPC-DS
- cd tools
- make (ignore the compile warnings, just ensure the build process completes successfully)

OK, you've now built the required utlities to use the benchmark. At this point it's probably useful to download and read the provided documentation in order to better understand the scope of what the benchmark provides, but here's a quick rundown of how to generate the testdata.


./dsdgen

will simply generate the test data (which is generated into | delimited text files with the extension *.dat) at the default scale factor (which is 1). Each scale factor corresponds to roughly 1GB of data, so, for example, the command


./dsdgen -scale 5 -force

will generate 5 GB of data and the -force option will overwrite previously generated data. Without the -force option, dsdgen will refuse to overwrite existing test data and simply do nothing.


Now that you have your test data ready, you can load it into your database. For MySql this rougly involves the following:


mysql -u <your_mysql_user> -p < tpcds.sql
</your_mysql_user>

Then for each *.dat file which was generated, do the following (see thispage for details:


LOAD DATA INFILE 'your_DAT_filename'
INTO TABLE table_the_DAT_file_is_for
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'

to load the data. See this previous blog post on how to setup MySql with InnoDB tables for good import performance so you don't spend lots of time waiting just because your database writes to disk after every insert. So this is your quick'n'dirty guide on how to use TPC-DS to generate and load lots of test data. Hopefully someone out there finds this useful. Enjoy.


相关参数:


[root@ht-hadoop3 tools]# ./dsdgen --help
ERROR: option '-help' or its argument unknown.
DBGEN2 Population Generator (Version 2.0.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2015




USAGE: DBGEN2 [options]


Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '-param [optional argument]'
Unique anchored substrings of options are also recognized, and 
case is ignored, so '-sc' is equivalent to '-SCALE'


General Options
===============
ABREVIATION =  <s>       -- build table with abreviation <s>
DIR =  <s>               -- generate tables in directory <s>
HELP =  <n>              -- display this message
PARAMS =  <s>            -- read parameters from file <s>
QUIET =  [Y|N]           -- disable all output to stdout/stderr
SCALE =  <n>             -- volume of data to generate in GB
TABLE =  <s>             -- build only table <s>
UPDATE =  <n>            -- generate update data set <n>
VERBOSE =  [Y|N]         -- enable verbose output
PARALLEL =  <n>          -- build data in <n> separate chunks
CHILD =  <n>             -- generate <n>th chunk of the parallelized data
RELEASE =  [Y|N]         -- display the release information
_FILTER =  [Y|N]         -- output data to stdout
VALIDATE =  [Y|N]        -- produce rows for data validation


Advanced Options
===============
DELIMITER =  <s>         -- use <s> as output field separator
DISTRIBUTIONS =  <s>     -- read distributions from file <s>
FORCE =  [Y|N]           -- over-write data files without prompting
SUFFIX =  <s>            -- use <s> as output file suffix
TERMINATE =  [Y|N]       -- end each record with a field delimiter
VCOUNT =  <n>            -- set number of validation rows to be produced
VSUFFIX =  <s>           -- set file suffix for data validation
RNGSEED =  <n>           -- set RNG seed

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值