文章目录
基于Spark的tpc-h查询语句及优化
1.实验目的
选择了tpc-h的22条语句进行查询测试,并尝试了shuffle优化、持久化、spark多核运行以及虚拟机设置多核运行等优化方式,进行优化设计并分析结果。
数据说明:TPC-H是TPC(Transaction Processing Performance Council)组织提供的用于进行OLAP测试,以评估商业分析中决策支持系统(DSS)性能的工具包。它包含了一整套面向商业的 ad-hoc 查询和并发数据修改,可以作为benchmark来测试数据库的查询性能。下载的tpc-h.zip中包括8份.tbl文件,分别为“nation.tbl”,“part.tbl”,“region.tbl”,“customer.tbl”,“supplier.tbl”,“partsupp.tbl”,“orders.tbl”,“lineitem.tbl”。
本次实验将基于该数据集进行tpc-h测试,并在此基础上尝试不同的查询优化方式。
2.实验环境
2.1 电脑&虚拟机配置
电脑:Windows 10 X64。
虚拟机:Ubuntu 64位。(伪分布式)
2.2 Java&Spark设置
3.数据上传&建表
-
Nation
case class Nation(N_NATIONKEY:Integer, N_NAME:String,N_REGIONKEY:Integer,N_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/nation.tbl").map(_.split('|')).map(p => Nation(p(0).toInt, p(1).toString,p(2).toInt,p(3).toString)).toDF() data.show() data.createOrReplaceTempView("nation")
-
Part
case class Part(P_PARTKEY:Integer,P_NAME:String,P_MFGR:String,P_BRAND:String,P_TYPE:String,P_SIZE:Integer,P_CONTAINER:String,P_RETAILPRICE:Double,P_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/part.tbl").map(_.split('|')).map(p => Part(p(0).toInt, p(1).toString,p(2).toString,p(3).toString,p(4).toString,p(5).toInt,p(6).toString,p(7).toDouble,p(8).toString)).toDF() data.show() data.createOrReplaceTempView("part")
-
Supplier
case class Supplier(S_SUPPKEY:Integer,S_NAME:String,S_ADDRESS:String,S_NATIONKEY:Integer,S_PHONE:String,S_ACCTBAL:Double,S_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/supplier.tbl").map(_.split('|')).map(p => Supplier(p(0).toInt, p(1).toString,p(2).toString,p(3).toInt,p(4).toString,p(5).toDouble,p(6).toString)).toDF() data.show() data.createOrReplaceTempView("supplier")
-
Partsupp
case class Partsupp(PS_PARTKEY:Integer,PS_SUPPKEY:Integer,PS_AVAILQTY:Integer,PS_SUPPLYCOST:Double,PS_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/partsupp.tbl").map(_.split('|')).map(p => Partsupp(p(0).toInt, p(1).toInt,p(2).toInt,p(3).toDouble,p(4).toString)).toDF() data.show() data.createOrReplaceTempView("partsupp")
-
Customer
case class Customer(C_CUSTKEY:Integer,C_NAME:String,C_ADDRESS:String,C_NATIONKEY:Integer,C_PHONE:String,C_ACCTBAL:Double,C_MKTSEGMENT:String,C_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/customer.tbl").map(_.split('|')).map(p => Customer(p(0).toInt, p(1).toString,p(2).toString,p(3).toInt,p(4).toString,p(5).toDouble,p(6).toString,p(7).toString)).toDF() data.show() data.createOrReplaceTempView("customer")
-
Orders
case class Orders(O_ORDERKEY:Integer,O_CUSTKEY:Integer,O_ORDERSTATUS:String,O_TOTALPRICE:Double,O_ORDERDATE:String,O_ORDERPRIORITY:String,O_CLERK:String,O_SHIPPRIORITY:Integer,O_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/orders.tbl").map(_.split('|')).map(p => Orders(p(0).toInt, p(1).toInt,p(2).toString,p(3).toDouble,p(4).toString,p(5).toString,p(6).toString,p(7).toInt,p(8).toString)).toDF() data.show() data.createOrReplaceTempView("orders")
-
LineItem
case class LineItem(L_ORDERKEY:Integer,L_PARTKEY:Integer,L_SUPPKEY:Integer,L_LINENUMBER:Integer,L_QUANTITY:Double,L_EXTENDEDPRICE:Double,L_DISCOUNT:Double,L_TAX:Double,L_RETURNFLAG:String,L_LINESTATUS:String,L_SHIPDATE:String,L_COMMITDATE:String,L_RECEIPTDATE:String,L_SHIPINSTRUCT:String,L_SHIPMODE:String,L_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/lineitem.tbl").map(_.split('|')).map(p => LineItem(p(0).toInt, p(1).toInt,p(2).toInt,p(3).toInt,p(4).toDouble,p(5).toDouble,p(6).toDouble,p(7).toDouble,p(8).toString, p(9).toString,p(10).toString,p(11).toString,p(12).toString,p(13).toString,p(14).toString,p(15).toString)).toDF() data.show() data.createOrReplaceTempView("lineItem")
-
Region
case class Region(R_REGIONKEY:Integer,R_NAME:String,R_COMMENT:String) val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/region.tbl").map(_.split('|')).map(p => Region(p(0).toInt, p(1).toString,p(2).toString)).toDF() data.show() data.createOrReplaceTempView("region")
4.tpc-h测试集22条查询
4.1 查询语句
-
sql_1
val sql_1= spark.sql("select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '84' day group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus limit 1;") sql_1.show()
-
sql_2
val sql_2=spark.sql("select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 16 and p_type like '%STEEL' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' ) order by s_acctbal desc, n_name, s_name, p_partkey limit 100;") sql_2.show()
-
sql_3
val sql_3=spark.sql("select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'MACHINERY' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-06' and l_shipdate > date '1995-03-06' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;") sql_3.show()
-
sql_4
val sql_4=spark.sql("select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-10-01' and o_orderdate < date '1993-10-01' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority limit 1;") sql_4.show()
-
sql_5
val sql_5=spark.sql("select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' and o_orderdate >= date '1996-01-01' and o_orderdate < date '1996-01-01' + interval '1' year group by n_name order by revenue desc limit 1;") sql_5.show()
-
sql_6
val sql_6=spark.sql("select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date '1996-01-01' and l_shipdate < date '1996-01-01' + interval '1' year and l_discount between 0.05 - 0.01 and 0.05 + 0.01 and l_quantity < 25 limit 1;") sql_6.show()
-
sql_7
val sql_7=spark.sql("select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'IRAN' and n2.n_name = 'ROMANIA') or (n1.n_name = 'ROMANIA' and n2.n_name = 'IRAN') ) and l_shipdate between date '1995-01-01' and date '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year limit 1;") sql_7.show()
-
sql_8
val sql_8=spark.sql("select o_year, sum(case when nation = 'ROMANIA' then volume else 0 end) / sum(volume) as mkt_share from ( select extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) as volume, n2.n_name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'EUROPE' and s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'STANDARD ANODIZED COPPER' ) as all_nations group by o_year order by o_year limit 1;") sql_8.show()
-
sql_9
val sql_9=spark.sql("select nation, o_year, sum(amount) as sum_profit from ( select n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%aquamarine%' ) as profit group by nation, o_year order by nation, o_year desc limit 1;") sql_9.show()
-
sql_10
val sql_10=spark.sql("select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1994-03-01' and o_orderdate < date '1994-03-01' + interval '3' month and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue desc limit 20;") sql_10.show()
-
sql_11
val sql_11=spark.sql("select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'PERU' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.0001000000 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'PERU' ) order by value desc limit 1;") sql_11.show()
-
sql_12
val sql_12=spark.sql("select l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1 else 0 end) as low_line_count from orders, lineitem where o_orderkey = l_orderkey and l_shipmode in ('RAIL', 'SHIP') and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= date '1993-01-01' and l_receiptdate < date '1993-01-01' + interval '1' year group by l_shipmode order by l_shipmode limit 1;") sql_12.show()
-
sql_13
val sql_13=spark.sql("select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%' group by c_custkey ) as c_orders group by c_count order by custdist desc, c_count desc limit 1;") sql_13.show()
-
sql_14
val sql_14=spark.sql("select 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue from lineitem, part where l_partkey = p_partkey and l_shipdate >= date '1997-12-01' and l_shipdate < date '1997-12-01' + interval '1' month limit 1;") sql_14.show()
-
sql_15
需要注册“revenue0”表以供查询,在spark sql不⽀持creat view
val temp1=spark.sql("select l_suppkey as supplier_no, sum(l_extendedprice * (1 - l_discount)) as total_revenue from lineitem where l_shipdate >= date '1993-03-01' and l_shipdate < date '1993-03-01' + interval '3' month group by l_suppkey;") temp1.createOrReplaceTempView("revenue0") val sql_15=spark.sql("select s_suppkey, s_name, s_address, s_phone, total_revenue from supplier, revenue0 where s_suppkey = supplier_no and total_revenue = ( select max(total_revenue) from revenue0 ) order by s_suppkey;") sql_15.show()
-
sql_16
val sql_16=spark.sql("select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#52' and p_type not like 'LARGE ANODIZED%' and p_size in (42, 38, 15, 48, 33, 3, 27, 45) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size limit 1;") sql_16.show()
-
sql_17
val sql_17=spark.sql("select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'WRAP CASE' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey ) limit 1;") sql_17.show()
-
sql_18
val sql_18=spark.sql("select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from customer, orders, lineitem where o_orderkey in ( select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 314 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate limit 100;") sql_18.show()
-
sql_19
val sql_19=spark.sql("select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#54' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 5 and l_quantity <= 5 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#25' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 19 and l_quantity <= 19 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#42' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 24 and l_quantity <= 24 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) limit 1;") sql_19.show()
-
sql_20
val sql_20=spark.sql("select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp where ps_partkey in ( select p_partkey from part where p_name like 'purple%' ) and ps_availqty > ( select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1994-01-01' and l_shipdate < date '1994-01-01' + interval '1' year ) ) and s_nationkey = n_nationkey and n_name = 'UNITED KINGDOM' order by s_name limit 1;") sql_20.show()
-
sql_21
val sql_21=spark.sql("select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'FRANCE' group by s_name order by numwait desc, s_name limit 100;") sql_21.show()
-
sql_22
val sql_22=spark.sql("select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( select substring(c_phone from 1 for 2) as cntrycode, c_acctbal from customer where substring(c_phone from 1 for 2) in ('33', '25', '16', '23', '32', '13', '19') and c_acctbal > ( select avg(c_acctbal) from customer where c_acctbal > 0.00 and substring(c_phone from 1 for 2) in ('33', '25', '16', '23', '32', '13', '19') ) and not exists ( select * from orders where o_custkey = c_custkey ) ) as custsale group by cntrycode order by cntrycode limit 1;") sql_22.show()
4.2 运行时间分析
-
在firefox浏览器中进入到localhost:4040/,得到运行22条tpc-h查询指令时的监测数据。监测数据截图如下:
-
点击最上方SQL,得到每条SQL语句对应的时间
其中,22条查询语句执行时间分别为:
查询语句 | 执行时间 |
---|---|
sql_1 | 1.1min |
sql_2 | 2.0min |
sql_3 | 1.5min |
sql_4 | 1.7min |
sql_5 | 2.0min |
sql_6 | 40s |
sql_7 | 2.4min |
sql_8 | 2.2min |
sql_9 | 2.4min |
sql_10 | 1.4min |
sql_11 | 45s |
sql_12 | 60s |
sql_13 | 32s |
sql_14 | 57s |
sql_15 | 1.8min |
sql_16 | 57s |
sql_17 | 1.7min |
sql_18 | 2.0min |
sql_19 | 48s |
sql_20 | 1.7min |
sql_21 | 3.2min |
sql_22 | 30s |
5.优化
5.1 数据倾斜
由不同sql查询语句执行的时间可看出,不同语句之间差距并不大,均在1~3分钟左右完成,故而改进可能带来的效果作用不大,无需特地改进。
5.2 shuffle
shuffle write阶段,主要就是在一个stage结束计算之后,为了下一个stage可以执行shuffle类的算子(比如reduceByKey),而将每个task处理的数据按key进行“分类”。所谓“分类”,就是对相同的key执行hash算法,从而将相同key都写入同一个磁盘文件中,而每一个磁盘文件都只属于下游stage的一个task。在将数据写入磁盘之前,会先将数据写入内存缓冲中,当内存缓冲填满之后,才会溢写到磁盘文件中去。而未经优化的shuffle write操作所产生的磁盘文件的数量是极其惊人的。
而在shuffle read阶段,通常就是一个stage刚开始时要做的事情。此时该stage的每一个task就需要将上一个stage的计算结果中的所有相同key,从各个节点上通过网络都拉取到自己所在的节点上,然后进行key的聚合或连接等操作。由于shuffle write的过程中,task给下游stage的每个task都创建了一个磁盘文件,因此shuffle read的过程中,每个task只要从上游stage的所有task所在节点上,拉取属于自己的那一个磁盘文件即可。
在shuffle优化步骤中,具体而言,我们可以使用HashShuffleManager中的consolidateFiles参数,如果设置为true,那么就会开启consolidate机制,会大幅度合并shuffle write的输出文件,对于shuffle read task数量特别多的情况下,这种方法可以极大地减少磁盘IO开销,提升性能。代码如下:
import org.apache.spark.SparkConf
val conf=new SparkConf()
conf.set("spark.shuffle.consolidateFiles","true")
在设置shuffle参数前其情况如下:
而设置shuffle后:
可看出两者运行时间对比为33min与12min,优化程度较为明显。
从每一项查询结果来看也是如此:
查询语句名称 | 原执行时间 | 改进后执行时间 | 倍数 |
---|---|---|---|
sql_1 | 1.1min | 26s | 2.54 |
sql_2 | 2.0min | 55s | 2.18 |
sql_3 | 1.5min | 36s | 2.50 |
sql_4 | 1.7min | 26s | 3.92 |
sql_5 | 2.0min | 48s | 2.50 |
sql_6 | 40s | 11s | 3.64 |
sql_7 | 2.4min | 50s | 2.88 |
sql_8 | 2.2min | 52s | 2.54 |
sql_9 | 2.4min | 54s | 2.67 |
sql_10 | 1.4min | 33s | 2.55 |
sql_11 | 45s | 19s | 2.37 |
sql_12 | 60s | 21s | 2.86 |
sql_13 | 32s | 12s | 2.67 |
sql_14 | 57s | 17s | 3.35 |
sql_15 | 1.8min | 30s | 3.60 |
sql_16 | 57s | 23s | 2.48 |
sql_17 | 1.7min | 37s | 2.76 |
sql_18 | 2.0min | 37s | 3.24 |
sql_19 | 48s | 16s | 3.00 |
sql_20 | 1.7min | 33s | 3.09 |
sql_21 | 3.2min | 1.2min | 2.67 |
sql_22 | 30s | 13s | 2.31 |
经计算,其优化速度倍数平均为2.8327倍,优化效果较为显著。
5.3 持久化
Spark 中一个很重要的能力是将数据持久化(或称为缓存),在多个操作间都可以访问这些持久化的数据。当持久化一个 RDD 时,每个节点的其它分区都可以使用 RDD 在内存中进行计算,在该数据上的其他 action 操作将直接使用内存中的数据。这样会让以后的 action 操作计算速度加快(通常运行速度会加速 10 倍)。缓存是迭代算法和快速的交互式使用的重要工具。
RDD 可以使用 persist() 方法或 cache() 方法进行持久化。数据将会在第一次 action 操作时进行计算,并缓存在节点的内存中。Spark 的缓存具有容错机制,如果一个缓存的 RDD 的某个分区丢失了,Spark 将按照原来的计算过程,自动重新计算并进行缓存。
代码改动:在每次建表时调用.cache()函数,使其存储在内存中。以supplier为例:
val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/supplier.tbl").map(_.split('|')).map(p => Supplier(p(0).toInt, p(1).toString,p(2).toString,p(3).toInt,p(4).toString,p(5).toDouble,p(6).toString)).toDF().cache()
运行时间结果如下:
查询语句名称 | 原执行时间 | 改进后执行时间 | 倍数 |
---|---|---|---|
sql_1 | 1.1min | 1.1min | 1.00 |
sql_2 | 2.0min | 25s | 4.80 |
sql_3 | 1.5min | 24s | 3.75 |
sql_4 | 1.7min | 14s | 7.29 |
sql_5 | 2.0min | 19s | 6.31 |
sql_6 | 40s | 5s | 8.00 |
sql_7 | 2.4min | 16s | 9.00 |
sql_8 | 2.2min | 13s | 10.15 |
sql_9 | 2.4min | 37s | 3.89 |
sql_10 | 1.4min | 26s | 3.23 |
sql_11 | 45s | 6s | 7.50 |
sql_12 | 60s | 13s | 4.61 |
sql_13 | 32s | 12s | 2.67 |
sql_14 | 57s | 6s | 9.50 |
sql_15 | 1.8min | 33s | 3.27 |
sql_16 | 57s | 32s | 1.78 |
sql_17 | 1.7min | 18s | 5.67 |
sql_18 | 2.0min | 18s | 6.67 |
sql_19 | 48s | 9s | 5.33 |
sql_20 | 1.7min | 16s | 6.38 |
sql_21 | 3.2min | 24s | 8.00 |
sql_22 | 30s | 12s | 2.50 |
对比可看出,持久化优化效果较为明显,执行时间范围由原本的13min优化到了1030s,基本都在1min甚至0.5min以内,经计算得平均每条运行时间优化倍数为5.5136倍。由此可见,持久化优化效果十分明显。
5.4 Spark多核运行
在运行spark时可以设置spark的核数:
./bin/spark-shell --master local[12]
以下两张图分别为单核(cores=1)和多核(cores=12)下运行的参数情况:
而具体执行时间如下:
查询语句 | 原执行时间 | 改进后执行时间 | 倍数 |
---|---|---|---|
sql_1 | 1.1min | 28s | 2.36 |
sql_2 | 2.0min | 54s | 2.22 |
sql_3 | 1.5min | 42s | 2.14 |
sql_4 | 1.7min | 34s | 3.00 |
sql_5 | 2.0min | 1.0min | 2.00 |
sql_6 | 40s | 14s | 2.86 |
sql_7 | 2.4min | 1.1min | 2.18 |
sql_8 | 2.2min | 53s | 2.49 |
sql_9 | 2.4min | 1.1min | 2.18 |
sql_10 | 1.4min | 37s | 2.27 |
sql_11 | 45s | 20s | 2.25 |
sql_12 | 60s | 24s | 2.50 |
sql_13 | 32s | 13s | 2.46 |
sql_14 | 57s | 19s | 3.00 |
sql_15 | 1.8min | 36s | 3.00 |
sql_16 | 57s | 26s | 2.19 |
sql_17 | 1.7min | 40s | 2.55 |
sql_18 | 2.0min | 47s | 2.55 |
sql_19 | 48s | 19s | 2.52 |
sql_20 | 1.7min | 43s | 2.37 |
sql_21 | 3.2min | 1.7min | 1.88 |
sql_22 | 30s | 16s | 1.88 |
经计算可得,其平均值为2.40,即平均优化后查询速度为优化前的2.40倍。可见优化效果较为明显。
5.5 虚拟机设置
将虚拟机核数(处理器)从1设置为2,查看效果
而具体到每条查询来看
查询语句 | 原执行时间 | 改进后执行时间 | 倍数 |
---|---|---|---|
sql_1 | 1.1min | 27s | 2.44 |
sql_2 | 2.0min | 1.1min | 1.82 |
sql_3 | 1.5min | 1.1min | 1.36 |
sql_4 | 1.7min | 26s | 3.92 |
sql_5 | 2.0min | 55s | 2.18 |
sql_6 | 40s | 13s | 3.08 |
sql_7 | 2.4min | 56s | 2.57 |
sql_8 | 2.2min | 48s | 2.75 |
sql_9 | 2.4min | 56s | 2.57 |
sql_10 | 1.4min | 32s | 2.63 |
sql_11 | 45s | 20s | 2.25 |
sql_12 | 60s | 21s | 2.86 |
sql_13 | 32s | 11s | 2.91 |
sql_14 | 57s | 19s | 3.00 |
sql_15 | 1.8min | 38s | 2.84 |
sql_16 | 57s | 23s | 2.48 |
sql_17 | 1.7min | 35s | 2.91 |
sql_18 | 2.0min | 42s | 2.86 |
sql_19 | 48s | 18s | 2.67 |
sql_20 | 1.7min | 39s | 2.61 |
sql_21 | 3.2min | 1.2min | 2.67 |
sql_22 | 30s | 13s | 2.31 |
经计算,虚拟机核数改进后平均速度改进倍数约为2.6223,较为显著。
6.结果与分析
在对22条查询语句的测试与优化中,我们发现在各种优化方式的效果排名如下:持久化优化(5.5136倍) > shuffle优化(2.8327) > 虚拟机多核运行(2.6223) > Spark 多核运行(2.40),其中持久化优化效果显著优于其他三种优化方式。
除此之外,我们也关注了数据分区、数据倾斜等方法,但是该数据集数据安排较为合理,各查询语句之间时间差不明显;而tpc-h的数据时间以天划分、数量巨大,难以根据时间做数据分区。
7.心得体会
通过这次实习,我们了解了tpc-h这一benchmark,通过对tpc-h的22条查询,我们深入体验了spark的建表操作以及查询操作,并且在该22条查询语句的基础上尝试了各种优化,通过这一系列实验,对spark的数据建立、数据存储、查询操作、优化方法以及相关原理都有了较为深刻的了解与体会。