基于国产服务器使用hive -testbench-hdp3工具测试hive的TPCDS

ITbreeze

已于 2024-02-04 08:38:28 修改

阅读量1.6k

点赞数 44

分类专栏：大数据国产化文章标签： hive hadoop 数据仓库

于 2024-02-02 16:43:29 首次发布

本文链接：https://blog.csdn.net/ITbreeze/article/details/135996066

版权

国产化同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

大数据

2 篇文章 0 订阅

订阅专栏

本文详细描述了如何在Hadoop环境下，通过Hive-testbench对TPC-DS进行测试，包括测试环境配置（如Ambari版本、HDP安装、数据格式选择），数据生成和查询执行步骤，以及遇到的问题与解决方案。

摘要由CSDN通过智能技术生成

目的

使用hive-testbench测试TPC-DS。

测试环境

1、Ambari2.7.6

2、集群未启用kerberose

3、长城DF729服务器、kylinV10SP1操作系统，建设的虚拟服务器。配置为：16c32g/200G

前提：

Hdp-3.0.0已经安装

Hive-3.1.0已经安装

Hdfs-3.1.0已经安装

Maven，如果未安装在tpcds-build时，自动安装

yum -y install gcc gcc-c++

下载hive -testbench-hdp3.zip ，根据自己安装的环境下载对应的测试套件 GitHub - hortonworks/hive-testbench

部署：

上传安装包，并编译

unzip hive -testbench-hdp3.zip

cd hive -testbench-hdp3

./tpcds-build.sh

生成数据和查询sql。(这里指定的rcfile的格式，/data/tpcds-generate此路径是在HDFS上面的路径，而非本地路径)

export FORMAT=rcfile

bash -x tpcds-setup.sh 2 /data/tpcds-generate

FORMAT=rcfile：代表数据文件格式，export FORMAT=rcfile可以不需要；默认生成是orc文件格式

2：代表生成10G数据

/data/tpcds-generate：代表生成数据的目录

默认数据格式为ORC，想要切换格式可以执行。

cd hive-testbench

FORMAT=textfile ./tpcds-setup.sh

or

cd hive-testbench

FORMAT=parquet ./tpcds-setup.sh 2

验证测试：

生成数据的案例

export FORMAT=rcfile

构建1 TB的TPC-DS数据： ./tpcds-setup.sh 1000

构建1 TB的TPC-H数据： ./tpch-setup.sh 1000

构建100 TB的TPC-DS数据： ./tpcds-setup.sh 100000

构建30 TB文本格式的TPC-DS数据： FORMAT=textfile ./tpcds-setup 30000

构建30 TB RCFile格式的TPC-DS数据： FORMAT=rcfile ./tpcds-setup 30000

测试查询执行

进行sql查询：hive -database tpcds_bin_partitioned_rcfile_2 -f query1.sql。tpcds_bin_partitioned_rcfile_2为相应的数据库，生成数据后会将数据写入数据库中

su - hive

hive

hive> use tpcds_bin_partitioned_rcfile_2;

hive> source query1.sql;

批量执行：

设置参数，此参数在执行TPCDS的时候，已经封装了。故需要先进行设置。

首先将文件进行复制（因为在默认的hive -testbench-hdp3包中，在sample-queries-tpcds目录下无此文件）：

cp sample-queries-tpch/testbench.settings  sample-queries-tpcds/

然而此文件里面的参数需要在hive的配置文件进行修改，并且重启hive组件（通过hive -i testbench.settings执行会报错，提示需要在配置文件中才能修改）。

最后将文件里面的内容清空，为了脚本执行的时候，不会报参数异常。（后续再考虑调整脚本程序，完善此能力不在读文件）

这个脚本有两个参数：suite scale，比如tpcds 2.

perl runSuite.pl tpcds 2  rcfile  hive > /tmp/tpcds.log 2>&1
or
perl  runSuite.pl tpcds 2  rcfile 'beeline  -n hive -u  "jdbc:hive2://hadoop02:2181,hadoop03:2181,hadoop01:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"'

最后的结果文件可以查看tpcds.log

 perl  runSuite.pl tpcds 2  rcfile 'beeline  -n hive -u  "jdbc:hive2://hadoop02:2181,hadoop03:2181,hadoop01:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" '  
params: tpcds, 2, rcfile, beeline  -n hive -u  "jdbc:hive2://hadoop02:2181,hadoop03:2181,hadoop01:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" ;
filename,status,time,rows
query1.sql,success,47,38.281
query10.sql,success,61,52.433
query11.sql,success,74,65.379
query12.sql,success,40,30.5
query13.sql,success,54,44.838
query14.sql,success,333,201.082
query14.sql,success,333,123.27
query15.sql,success,44,34.407
query16.sql,success,100,91.866
query17.sql,success,59,49.796

提升点：

可以修改的更通用的方式：

1、数据库的格式orc，通过修改支持多样格式，通过传参的方式实现。

2、是hive的连接方式，可以设置为beeline等方式，通过传参的方式实现。

3、打印正在执行的cmd。将初始化环境的时间能够计算进来，且使用beeline连接server的耗时更真实、更加贴近实际的环境。修改之后可以用于其他测试，比如spark-sql、impala等；

这个感谢此博主的分享，站在巨人肩膀上面进行了一些修改，包括beeline的连接以及显示的时间和行数等细致的问题。hive 性能验收 hive testbench_mob64ca1402d47a的技术博客_51CTO博客

#!/usr/bin/perl
 
use strict;
use warnings;
use POSIX;
use File::Basename;
 
# PROTOTYPES
sub dieWithUsage(;$);
 
# GLOBALS
my $SCRIPT_NAME = basename( __FILE__ );
my $SCRIPT_PATH = dirname( __FILE__ );
 
# MAIN
dieWithUsage("one or more parameters not defined") unless @ARGV >= 4;
my $suite = shift;
my $scale = shift || 2;
my $format = shift || 3;
my $engineCmd = shift || 4;
dieWithUsage("suite name required") unless $suite eq "tpcds" or $suite eq "tpch";
print "params: $suite, $scale, $format, $engineCmd;";
 
chdir $SCRIPT_PATH;
if( $suite eq 'tpcds' ) {
        chdir "sample-queries-tpcds";
} else {
        chdir 'sample-queries-tpch';
} # end if
my @queries = glob '*.sql';
 
my $db = { 
        'tpcds' => "tpcds_bin_partitioned_${format}_$scale",
        'tpch' => "tpch_flat_${format}_$scale"
};
 
print "filename,status,rows,time\n";
for my $query ( @queries ) {
        my $logname = "$query.log";
#        my $cmd="${engineCmd}/$db->{${suite}} -i conf.settings -f $query 2>&1  | tee $query.log";
        my $cmd="echo 'use $db->{${suite}}; source $query;' | ${engineCmd} -i testbench.settings 2>&1  | tee $query.log";
#       my $cmd="cat $query.log";
        #print $cmd ; exit;
        my $currentTime = strftime("%Y-%m-%d %H:%M:%S", localtime(time));
        print "$currentTime : ";
        #print "$cmd \n";
 
        my $hiveStart = time();
 
        my @hiveoutput=`$cmd`;
        die "${SCRIPT_NAME}:: ERROR:  hive command unexpectedly exited \$? = '$?', \$! = '$!'" if $?;
 
        my $hiveEnd = time();
        my $hiveTime = $hiveEnd - $hiveStart;
        my $is_success = 0;
        foreach my $line ( @hiveoutput ) {
                if( $line =~ /[(\d+|No)]\s+row[s]? selected \(([\d\.]+) seconds\)/ ) {
                        $is_success = 1;
                        print "$query,success,$hiveTime,$1\n"; 
                } # end if
        } # end while
        if( $is_success == 0) {
                print "$query,failed,$hiveTime\n";
        }
} # end for
 
 
sub dieWithUsage(;$) {
        my $err = shift || '';
        if( $err ne '' ) {
                chomp $err;
                $err = "ERROR: $err\n\n";
        } # end if
 
        print STDERR <<USAGE;
${err}Usage:
        perl ${SCRIPT_NAME} [tpcds|tpch] [scale] [format] [engineCmd]
 
Description:
        This script runs the sample queries and outputs a CSV file of the time it took each query to run.  Also, all hive output is kept as a log file named 'queryXX.sql.log' for each query file of the form 'queryXX.sql'. Defaults to scale of 4.
USAGE
        exit 1;
}

beeline方式连接的，修改的testbench.settings 文件信息（仅作参考）:

#set ambari.hive.db.schema.name=hive;
#set fs.file.impl.disable.cache=true;
#set fs.hdfs.impl.disable.cache=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join=true;
set hive.auto.convert.sortmerge.join=true;
#set hive.compactor.abortedtxn.threshold=1000;
#set hive.compactor.check.interval=300;
#set hive.compactor.delta.num.threshold=10;
#set hive.compactor.delta.pct.threshold=0.1f;
#set hive.compactor.initiator.on=false;
#set hive.compactor.worker.threads=0;
#set hive.compactor.worker.timeout=86400;
#set hive.compute.query.using.stats=true;
#set hive.enforce.bucketing=true;
#set hive.enforce.sorting=true;
#set hive.enforce.sortmergebucketmapjoin=true;
#set hive.execution.engine=mr;
set hive.execution.engine=tez;
set hive.limit.pushdown.memory.usage=0.04;
set hive.map.aggr=true;
set hive.mapjoin.bucket.cache.size=10000;
#set hive.mapred.reduce.tasks.speculative.execution=false;
#set hive.metastore.cache.pinobjtypes=Table,Database,Type,FieldSchema,Order;
#set hive.metastore.client.socket.timeout=60;
#set hive.metastore.execute.setugi=true;
#set hive.metastore.warehouse.dir=/apps/hive/warehouse;
#set hive.metastore.warehouse.dir=/warehouse/tablespace/managed/hive;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.index.filter=true;
set hive.optimize.reducededuplication.min.reducer=4;
set hive.optimize.reducededuplication=true;
set hive.orc.splits.include.file.footer=false;
#set hive.security.authorization.enabled=false;
#set hive.security.metastore.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider;
#set hive.server2.enable.doAs=false;
#set hive.server2.tez.default.queues=default;
#set hive.server2.tez.initialize.default.sessions=false;
#set hive.server2.tez.sessions.per.default.queue=1;
set hive.stats.autogather=true;
set hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
#set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager;
#set hive.txn.max.open.batch=1000;
#set hive.txn.timeout=300;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.groupby.checkinterval=1024;
set hive.vectorized.groupby.flush.percent=1;
set hive.vectorized.groupby.maxentries=1024;

-- These values need to be tuned appropriately to your cluster. These examples are for reference.
-- set hive.tez.container.size=4096;
-- set hive.tez.java.opts=-Xmx3800m;
-- set hive.auto.convert.join.noconditionaltask.size=1252698795;

部分测试参考结果：

此结果是将结果数据导出，放到Excel中进行生成的（仅参考）。