hive-tpcds-sql99测试详解

sunxunyong

于 2024-07-02 08:14:38 发布

阅读量751

点赞数 17

文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/sunxunyong/article/details/140116276

版权

一、SQL99是什么
（1）是操作所有关系型数据库的规则
（2）是第四代语言
（3）是一种结构化查询语言
（4）只需发出合法合理的命令，就有对应的结果显示

SQL92/【99】标准的四大分类 ：
（A）DML（数据操纵语言）：select,insert,update,delete
（B）DDL（数据定义语言）：create table,alter table，drop table，truncate table 。。。
（C）DCL（数据控制语言）：grant 权限 to scott，revoke 权限 from scott 。。。
（D）TCL（事务控制语言）：commit，rollback，rollback to savepoint 。。。

SQL的特点
（1）交互性强，非过程化
（2）数据库操纵能力强，只需发送命令，无需关注如何实现
（3）多表操作时，自动导航简单，例如：
	select emp.empno,emp.sal,dept.dname
	from emp,dept
	where emp.deptno = dept.deptno
（4）容易调试，错误提示，直接了当
（5）SQL强调结果

二、TPC-DS（TPC Benchmark DS）基准测试标准
TPC-DS是一套决策支持系统测试基准，主要针对零售行业。
提供99个SQL查询（SQL99或2003），分析数据量大，测试数据与实际商业数据高度相似，同时具有各种业务模型（分析报告型，数据挖掘型等等）。
国内目前相关的翻译文章较少。本文尝试对官网的TPC BENCHMARK DS Standard Specification（下称“原文”）进行翻译。
翻译主要参照的是2017年发布的2.6.0版本。
现在可以在 http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp 查询到最新的版本。

TPC-DS 2.0第一个基于SQL的大数据系统基准测试标准

TPC-DS 2.0现在已经是TPC的第二个大数据基准测试标准。
在2014年8月，TPC引入了TPCx-HS——业界第一个大数据系统基准测试标准，用来评估大规模系统拓扑结构和实现方法。TPCx-HS也是TPC的第一个“Express”级基准测试标准，可下载工具包获得。
展望未来，TPC已开始迈向第三个大数据基准测试标准，“Express”级标准——TPCx-BB，目前正在接受公开审查。TPC鼓励有兴趣的朋友发表自己的审查意见。除此以外，TPC鼓励任何对TPC在大数据和其他领域工作有兴趣的朋友访问TPC的会员页面，或提交原创未经发表的论文到即将召开的技术会议上(TPCTC)。

官网地址：http://www.tpc.org/tpcds/default.asp

三、HIVE TPC-DS Benchmark
http://sungsoo.github.io/2014/08/02/tpc-ds-benchmark.html

1、hive-testbench生成数据
	hive-testbench是一个数据生成器和一组查询，可以让您大规模地尝试Apache Hive。测试台允许您在大型数据集上体验基本Hive性能，并且可以轻松查看Hive调优参数和高级设置的影响。
	测试数据为TPC-DS基准测试的数据，官方文档：http://www.tpc.org/information/current_specifications.asp，
	
2.1 下载hive-testbench
	git clone https://github.com/hortonworks/hive-testbench
	hive-testbench这个项目是用于生成TPC-DS数据集并且将其导入到hive，在使用之前需要保证已经将hive、hadoop等命令加入到PATH中。
	
	根据hdp的版本进行下载。

export MAVEN_HOME=/data/data10/test/apache-maven-3.6.3
export PATH= $P A T H :$ MAVEN_HOME/bin

2.2 编译
	hive-testbench附带了基于TPC-DS和TPC-H基准的数据生成器和示例查询。可以选择使用这两个基准中的一个或两个进行实验。
	对于TPC-DS：./tpcds-build.sh下载、编译和打包TPC-DS数据生成器。
	对于TPC-H ：./tpch-build.sh下载、编译和打包TPC-H数据生成器。

	进入hive-testbench目录，执行./tpcds-build.sh，该命令会从TPC-DS下载源代码、并编译、初始化metaStore，为导入数据到hive做准备。
	
2.3 生成并导入数据
	tpcds-setup.sh和tpch-setup.sh分别为TPC-DS和TPC-H生成、加载数据。
	一般用法是tpcds-setup.sh <scale_factor> [directory]或tpch-setup.sh <scale_factor> [directory]
	
	(1)、首先，在导入之前，修改tpcds-setup.sh，找到以下行，根据具体的hive环境参数进行修改：
		HIVE="beeline -n hive -u 'jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' "

jdbc:hive2://B@@@-12:2181,B@@@-22:2181,B@@@-32:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
(2)、导入ORC数据：
./tpcds-setup.sh 100 默认的文件格式就是ORC，所以不需要指定存储格式。
参数100表示生成100GB的数据，该程序首先会生成text格式的数据到临时目录，然后再将这些数据转换成orc或者parquet格式。

		定义环境变量：
		FORMAT：数据存储格式。
		BUCKET_DATA：BUCKET_DATA参数比较重要，默认是1，若该参数不为空则为13.
		DEBUG_SCRIPT：debug模式。
		如：
			BUCKET_DATA=Y DEBUG_SCRIPT=Y MAX_REDUCERS=默认是<scale_factor>
			FORMAT=rcfile BUCKET_DATA=Y DEBUG_SCRIPT=Y ./tpcds-setup.sh 30
			BUCKET_DATA=Y DEBUG_SCRIPT=Y ./tpcds-setup.sh 5
		

	A、举例：
		生成1 TB的TPC-DS数据  : ./tpcds-setup.sh 1000 格式默认为orc
		生成1 TB的TPC-H数据   : ./tpch-setup.sh 1000
		生成100 TB的TPC-DS数据: ./tpcds-setup.sh 100000

		生成30 TB text格式的TPC-DS数据  : FORMAT=textfile ./tpcds-setup 30000
		生成30 TB RCFile格式的TPC-DS数据: FORMAT=rcfile ./tpcds-setup 30000
		生成100GB Parquet格式的TPC-DS数据数据：FORMAT=parquet ./tpcds-setup.sh 100 --指定文件格式为Parquet
	
	A、该命令会：
		首先，在hdfs上创建数据目录：/tmp/tpcds-generate/<scale_factor>,创建text格式的数据文件。
		然后，先创建text类型的数据库tpcds_text_${SCALE}，并创建外部分区表：执行hive -i settings/load-flat.sql -f ddl-tpcds/text/alltables.sql --hivevar DB=tpcds_text_2 --hivevar LOCATION=/tmp/tpcds-generate/2/
			  ddl-tpcds/text/alltables.sql为24张表的建表语句。hdfs dfs -ls /tmp/tpcds-generate/80
		然后，从tpcds_text_${SCALE}数据库（外部表）导入到tpcds_bin_partitioned_orc_<scale_factor>数据库（内部表）中，转换为需要的格式。根据参数生成数据转换的脚本：load_${FORMAT}_${SCALE}.mk，如：load_ORC_2.mk文件。
			首先，生成维表的执行sql：将数据导入到tpcds_text_<scale_factor>数据库中的17个维度表中。sql为ddl-tpcds/bin_partitioned/<table>.sql。每张表load完成后，输出信息：Optimizing table <tableName> (17/24)。
			然后，生成事实表的执行sql：将数据文件load到tpcds_text_<scale_factor>数据库中的7个事实表中。每张表load完成后，输出信息：Optimizing table <tableName> (17/24)。
			最后，执行make命令来执行这个load脚本sql “load_ORC_2.mk”。将数据load到hive中。

		24张表的导入顺序:
			date_dim,time_dim,item,customer,customer_demographics,household_demographics,customer_address,store,promotion,warehouse,ship_mode,reason,income_band,call_center,web_page,catalog_page,web_site,
			store_sales,store_returns,web_sales,web_returns,catalog_sales,catalog_returns,inventory
		
		最后输出：Data loaded into database xxxx

	A、执行结果：
		执行完后，会发现在hive中多了一个tpcds_bin_partitioned_orc_<scale_factor>的数据库。
		这个数据库中一共24个表：7个事实表，17个维度表，每一个事实表和大部分的维度表组成雪花模型。
	

	A、常见问题：
		[1]、如果执行tpcds-setup.sh报错，请检查shell中beeline连接url中的用户（默认是hive,该用户需要有管理员权限）、zk的地址（默认是localhost:2181）。
		[2]、在导入5G数据时，第18张表store_sales会比较慢，最后报错导致后续的表均无法导入。比如2G的数据，该表有5760749条数据。
		  方法一、在ambari页面上，修改hive的配置参数：
				  hive.exec.max.created.files为1000000（默认是100000，在hive session里使用set不生效），
				  hive.exec.max.dynamic.partitions.pernode=100000; 默认是2000，在hive session里设置不生效。
				  在settings/load-partitioned.sql中配置为使用mr引擎：set hive.execution.engine=mr; 。
		  方法二、第18-24张表导入失败时，单独执行load_ORC_5.mk中的数据导入sql。
		  方法三、看日志查找具体的错误原因：yarn logs -applicationId application_1551109758046_0099
		
		[3]、生成100G数据时报错：
			Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread
			解决办法：修改mapreduce2的mapred_user_nofile_limit=102400，mapred_user_nproc_limit=102400

	安装HDP时，设置的容器最大内存为4G，而Tez的容器大小为11G，将Tez的容器大小改为4G后
		
2.4 执行测试
	hive-testbench包含超过50个TPC-DS查询sql和所有TPC-H查询sql。

	(1)、可以使用hive客户端命令进行执行：
	## cd /home/ocdp/hive-testbench-hdp3/sample-queries-tpcds
	## hive
	 hive> use tpcds_bin_partitioned_orc_10;
	 hive> source query55.sql;


	(2)、还可以登录beline，然后执行!run query1.sql; 来执行相关的查询。
	  [1]、hive thrift server:
		beeline -n ocdp -u 'jdbc:hive2://@@@50.65:2181/tpcds_bin_partitioned_orc_10;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=hive'
	  
	  [2]、HDP 2.6 llap：
		beeline -u 'jdbc:hive2://demo01.host:2181,demo02.host:2181,demo03.host:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive'
	
	  [3]、HDP3 llap：
		cd /home/ocdp/hive-testbench-hdp3/sample-queries-tpcds/
		beeline -n ocdp -u 'jdbc:hive2://@@@50.65:2181/tpcds_bin_partitioned_orc_10;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive'
		beeline -n ocdp -u 'jdbc:hive2://@@@@:10500/tpcds_bin_partitioned_orc_10' -f /home/ocdp/hive-testbench-hdp3/sample-queries-tpcds/query1.sql
	  
	  [4]、HDP3 spark2 thrift server
		/usr/hdp/3.1.0.0-78/spark2/bin/beeline -n ocdp -u "jdbc:hive2://@@@50.63:10016/tpcds_bin_partitioned_orc_10" -f query1.sql
		spark-sql 

		然后执行!run query1.sql

	(3)、常见问题
		1、hdp5执行最简单的select查询报错：
		  2019-02-27T09:37:09,308 WARN  [main]: server.HiveServer2 (HiveServer2.java:startHiveServer2(1100)) - Error starting HiveServer2 on attempt 1, will retry in 60000ms
		  java.lang.RuntimeException: Error initializing the query results cache
		  Caused by: java.lang.RuntimeException: The dir: /tmp/hive/_resultscache_ on HDFS should be writable. Current permissions are: rwxr-xr-x
			at org.apache.hadoop.hive.ql.exec.Utilities.ensurePathIsWritable(Utilities.java:4512) ~[hive-exec-3.1.0.3.1.0.0-78.jar:3.1.0.3.1.0.0-78]
		
		【解决办法】：hdfs dfs -chmod -R 777 /tmp/hive/_resultscache_，然后重启llap。
		可能是由于hive没启用kerberos，允许匿名访问导致的。在启用kerberos时，未发现该问题。
		
		2、spark2 thrift server查看hive库，不报错，没有查询结果。select count均为0，
		后台日志报：
		WARN HiveMetastoreCatalog: Unable to infer schema for table tpcds_bin_partitioned_orc_10.web_site from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.

19/02/27 17:04:43 WARN HiveMetastoreCatalog: Unable to infer schema for table tpcds_bin_partitioned_orc_10.web_site from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
19/02/27 17:04:43 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting ‘spark.debug.maxToStringFields’ in SparkEnv.conf.

use tpcds_bin_partitioned_orc_10;
drop table date_dim2;
create table date_dim2
stored as orc
as select * from tpcds_text_10.date_dim;
在hive创建并导入数据的表，在spark中能看到表，但无法查看数据。
在spark创建并导入数据的表，在spark中能看到表，也能查询到数据，但是在hive中使用tez做引擎count(*)报错,select * limit 10可以，使用mr引擎可以count，可以select * limit 10。

sunxunyong

关注

17
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
hive-tpcds-sql99测试详解

在spark创建并导入数据的表，在spark中能看到表，也能查询到数据，但是在hive中使用tez做引擎count(*)报错,select * limit 10可以，使用mr引擎可以count，可以select * limit 10。提供99个SQL查询（SQL99或2003），分析数据量大，测试数据与实际商业数据高度相似，同时具有各种业务模型（分析报告型，数据挖掘型等等）。在hive创建并导入数据的表，在spark中能看到表，但无法查看数据。（4）只需发出合法合理的命令，就有对应的结果显示。
复制链接

扫一扫