1 TPC-H介绍
TPC-H(商业智能计算测试)是TPC的重要测试标准之一,主要用来模拟真实商业的应用环境。
TPC-H 用 3NF 实现了一个数据仓库,共包含 8 个基本关系/表,其中表REGION和表NATION的记录数是固定的(分别为5和25),其它6个表的记录数,则随所设定的参数SF而有所不同,其数据量可以设定从 1GB~3TB 不等。有8个级别供用户选择。
2 TPC-H工具下载
http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
如果链接不可用,请到官网找到TPC-H下载:http://www.tpc.org
下载2.17.1
3 生产测试数据
在linux系统中解压
unzip tpc-h-tool-2.17.1.zip
tpch_2_17_0/dbgen/
进入dbgen目录后,
复制 makefile.suite文件(此步骤的目的是为了备份原有的makefile.suite)
编辑makefile
cp makefile.suite makefile
vim makefile
修改以下四处地方
执行
make -f makefile
生产数据:
./dbgen -s 1
-s 1 表示生成1G的数据(如果你之前曾经尝试过生成数据,最好先make clean,再重新make,接着到这步加上-f覆盖掉)
生成之后可以用head命令检查一下tbl们,会看到每一行都有一些用“|”隔开的字段.
./dbgen -h 帮助命令
执行命令后会生成8个数据文件
4 复制数据到hdfs
hdfs dfs -mkdir /testdata
hdfs dfs -mkdir /testdata/region
hdfs dfs -mkdir /testdata/supplier
hdfs dfs -mkdir /testdata/part
hdfs dfs -mkdir /testdata/customer
hdfs dfs -mkdir /testdata/lineitem
hdfs dfs -mkdir /testdata/nation
hdfs dfs -mkdir /testdata/orders
hdfs dfs -mkdir /testdata/partsupp
hdfs dfs -put /tmp/bigdata/region.tbl /testdata/region
hdfs dfs -put /tmp/bigdata/supplier.tbl /testdata/supplier
hdfs dfs -put /tmp/bigdata/part.tbl /testdata/part
hdfs dfs -put /tmp/bigdata/customer.tbl /testdata/customer
hdfs dfs -put /tmp/bigdata/lineitem.tbl /testdata/lineitem
hdfs dfs -put /tmp/bigdata/nation.tbl /testdata/nation
hdfs dfs -put /tmp/bigdata/orders.tbl /testdata/orders
hdfs dfs -put /tmp/bigdata/partsupp.tbl /testdata/partsupp
将那8个文件复制到对应的hdfs目录下面
5 创建impala外部关联表
CREATE EXTERNAL TABLE IF NOT EXISTS region
(
R_REGIONKEY INT,
R_NAME STRING,
R_COMMENT STRING
) COMMENT 'The file of TPCH is region.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/region';
CREATE EXTERNAL TABLE IF NOT EXISTS nation
(
N_NATIONKEY INT,
N_NAME STRING,
N_REGIONKEY INT,
N_COMMENT STRING
) COMMENT 'The file of TPCH is nation.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/nation';
CREATE EXTERNAL TABLE IF NOT EXISTS part
(
P_PARTKEY INT,
P_NAME STRING,
P_MFGR STRING,
P_BRAND STRING,
P_TYPE STRING,
P_SIZE INT,
P_CONTAINER STRING,
P_RETAILPRICE DECIMAL(15,2),
P_COMMENT STRING
) COMMENT 'The file of TPCH is part.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/part';
CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
(
L_ORDERKEY INT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INT,
L_QUANTITY DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_RETURNFLAG STRING,
L_LINESTATUS STRING,
L_SHIPDATE timestamp,
L_COMMITDATE timestamp,
L_RECEIPTDATE timestamp,
L_SHIPINSTRUCT STRING,
L_SHIPMODE STRING,
L_COMMENT STRING
) COMMENT 'The file of TPCH is lineitem.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/lineitem';
CREATE EXTERNAL TABLE IF NOT EXISTS supplier (
S_SUPPKEY INT,
S_NAME STRING,
S_ADDRESS STRING,
S_NATIONKEY INTEGER,
S_PHONE STRING,
S_ACCTBAL DECIMAL(15,2),
S_COMMENT STRING
)COMMENT 'The file of TPCH is supplier.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/supplier';
CREATE EXTERNAL TABLE IF NOT EXISTS partsupp (
PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DECIMAL(15,2),
PS_COMMENT STRING
)COMMENT 'The file of TPCH is partsupp.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/partsupp';
CREATE EXTERNAL TABLE IF NOT EXISTS customer (
C_CUSTKEY INT,
C_NAME STRING,
C_ADDRESS STRING,
C_NATIONKEY INT,
C_PHONE STRING,
C_ACCTBAL DECIMAL(15,2),
C_MKTSEGMENT STRING,
C_COMMENT STRING
)COMMENT 'The file of TPCH is customer.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/customer';
CREATE EXTERNAL TABLE IF NOT EXISTS orders (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DECIMAL(15,2),
O_ORDERDATE timestamp,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING
)COMMENT 'The file of TPCH is orders.tbl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/testdata/orders';
现在就可以使用sql查询这些表了。