大数据管理技术 | 基于Spark的tpc-h22条查询语句及优化

基于Spark的tpc-h查询语句及优化

1.实验目的

选择了tpc-h的22条语句进行查询测试,并尝试了shuffle优化、持久化、spark多核运行以及虚拟机设置多核运行等优化方式,进行优化设计并分析结果。

数据说明:TPC-H是TPC(Transaction Processing Performance Council)组织提供的用于进行OLAP测试,以评估商业分析中决策支持系统(DSS)性能的工具包。它包含了一整套面向商业的 ad-hoc 查询和并发数据修改,可以作为benchmark来测试数据库的查询性能。下载的tpc-h.zip中包括8份.tbl文件,分别为“nation.tbl”,“part.tbl”,“region.tbl”,“customer.tbl”,“supplier.tbl”,“partsupp.tbl”,“orders.tbl”,“lineitem.tbl”。

本次实验将基于该数据集进行tpc-h测试,并在此基础上尝试不同的查询优化方式。

2.实验环境

2.1 电脑&虚拟机配置

电脑:Windows 10 X64。

虚拟机:Ubuntu 64位。(伪分布式)

2.2 Java&Spark设置

在这里插入图片描述

3.数据上传&建表

  1. Nation

    case class Nation(N_NATIONKEY:Integer, N_NAME:String,N_REGIONKEY:Integer,N_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/nation.tbl").map(_.split('|')).map(p => Nation(p(0).toInt, p(1).toString,p(2).toInt,p(3).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("nation")

    在这里插入图片描述

  2. Part

    case class Part(P_PARTKEY:Integer,P_NAME:String,P_MFGR:String,P_BRAND:String,P_TYPE:String,P_SIZE:Integer,P_CONTAINER:String,P_RETAILPRICE:Double,P_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/part.tbl").map(_.split('|')).map(p => Part(p(0).toInt, p(1).toString,p(2).toString,p(3).toString,p(4).toString,p(5).toInt,p(6).toString,p(7).toDouble,p(8).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("part")

在这里插入图片描述

  1. Supplier

    case class Supplier(S_SUPPKEY:Integer,S_NAME:String,S_ADDRESS:String,S_NATIONKEY:Integer,S_PHONE:String,S_ACCTBAL:Double,S_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/supplier.tbl").map(_.split('|')).map(p => Supplier(p(0).toInt, p(1).toString,p(2).toString,p(3).toInt,p(4).toString,p(5).toDouble,p(6).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("supplier")

在这里插入图片描述

  1. Partsupp

    case class Partsupp(PS_PARTKEY:Integer,PS_SUPPKEY:Integer,PS_AVAILQTY:Integer,PS_SUPPLYCOST:Double,PS_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/partsupp.tbl").map(_.split('|')).map(p => Partsupp(p(0).toInt, p(1).toInt,p(2).toInt,p(3).toDouble,p(4).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("partsupp")

    在这里插入图片描述

  2. Customer

    case class Customer(C_CUSTKEY:Integer,C_NAME:String,C_ADDRESS:String,C_NATIONKEY:Integer,C_PHONE:String,C_ACCTBAL:Double,C_MKTSEGMENT:String,C_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/customer.tbl").map(_.split('|')).map(p => Customer(p(0).toInt, p(1).toString,p(2).toString,p(3).toInt,p(4).toString,p(5).toDouble,p(6).toString,p(7).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("customer")

在这里插入图片描述

  1. Orders

    case class Orders(O_ORDERKEY:Integer,O_CUSTKEY:Integer,O_ORDERSTATUS:String,O_TOTALPRICE:Double,O_ORDERDATE:String,O_ORDERPRIORITY:String,O_CLERK:String,O_SHIPPRIORITY:Integer,O_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/orders.tbl").map(_.split('|')).map(p => Orders(p(0).toInt, p(1).toInt,p(2).toString,p(3).toDouble,p(4).toString,p(5).toString,p(6).toString,p(7).toInt,p(8).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("orders")

在这里插入图片描述

  1. LineItem

    case class LineItem(L_ORDERKEY:Integer,L_PARTKEY:Integer,L_SUPPKEY:Integer,L_LINENUMBER:Integer,L_QUANTITY:Double,L_EXTENDEDPRICE:Double,L_DISCOUNT:Double,L_TAX:Double,L_RETURNFLAG:String,L_LINESTATUS:String,L_SHIPDATE:String,L_COMMITDATE:String,L_RECEIPTDATE:String,L_SHIPINSTRUCT:String,L_SHIPMODE:String,L_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/lineitem.tbl").map(_.split('|')).map(p => LineItem(p(0).toInt, p(1).toInt,p(2).toInt,p(3).toInt,p(4).toDouble,p(5).toDouble,p(6).toDouble,p(7).toDouble,p(8).toString, p(9).toString,p(10).toString,p(11).toString,p(12).toString,p(13).toString,p(14).toString,p(15).toString)).toDF()
    data.show()
    
    data.createOrReplaceTempView("lineItem")

在这里插入图片描述

在这里插入图片描述

  1. Region

    case class Region(R_REGIONKEY:Integer,R_NAME:String,R_COMMENT:String)
    
    val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/region.tbl").map(_.split('|')).map(p => Region(p(0).toInt, p(1).toString,p(2).toString)).toDF()
    
    data.show()
    
    data.createOrReplaceTempView("region")

在这里插入图片描述

4.tpc-h测试集22条查询

4.1 查询语句

  1. sql_1

    val sql_1= spark.sql("select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '84' day group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus limit 1;")
    sql_1.show()

    在这里插入图片描述

  2. sql_2

    val sql_2=spark.sql("select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 16 and p_type like '%STEEL' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' ) order by s_acctbal desc, n_name, s_name, p_partkey limit 100;")
    sql_2.show()

在这里插入图片描述

  1. sql_3

    val sql_3=spark.sql("select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'MACHINERY' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-06' and l_shipdate > date '1995-03-06' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;")
    sql_3.show()

在这里插入图片描述

  1. sql_4

    val sql_4=spark.sql("select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-10-01' and o_orderdate < date '1993-10-01' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority limit 1;")
    sql_4.show()

在这里插入图片描述

  1. sql_5

    val sql_5=spark.sql("select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AMERICA' and o_orderdate >= date '1996-01-01' and o_orderdate < date '1996-01-01' + interval '1' year group by n_name order by revenue desc limit 1;") 
    sql_5.show()

在这里插入图片描述

  1. sql_6

    val sql_6=spark.sql("select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date '1996-01-01' and l_shipdate < date '1996-01-01' + interval '1' year and l_discount between 0.05 - 0.01 and 0.05 + 0.01 and l_quantity < 25 limit 1;")
    sql_6.show()

在这里插入图片描述

  1. sql_7

    val sql_7=spark.sql("select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'IRAN' and n2.n_name = 'ROMANIA') or (n1.n_name = 'ROMANIA' and n2.n_name = 'IRAN') ) and l_shipdate between date '1995-01-01' and date '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year limit 1;")
    sql_7.show()

在这里插入图片描述

  1. sql_8

    val sql_8=spark.sql("select o_year, sum(case when nation = 'ROMANIA' then volume else 0 end) / sum(volume) as mkt_share from ( select extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) as volume, n2.n_name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'EUROPE' and s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'STANDARD ANODIZED COPPER' ) as all_nations group by o_year order by o_year limit 1;")
    sql_8.show()

在这里插入图片描述

  1. sql_9

    val sql_9=spark.sql("select nation, o_year, sum(amount) as sum_profit from ( select n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%aquamarine%' ) as profit group by nation, o_year order by nation, o_year desc limit 1;")
    sql_9.show()

在这里插入图片描述

  1. sql_10

    val sql_10=spark.sql("select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1994-03-01' and o_orderdate < date '1994-03-01' + interval '3' month and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue desc limit 20;")
    sql_10.show()

    在这里插入图片描述

  2. sql_11

    val sql_11=spark.sql("select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'PERU' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.0001000000 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'PERU' ) order by value desc limit 1;")
    sql_11.show()

在这里插入图片描述

  1. sql_12

    val sql_12=spark.sql("select l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1 else 0 end) as low_line_count from orders, lineitem where o_orderkey = l_orderkey and l_shipmode in ('RAIL', 'SHIP') and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= date '1993-01-01' and l_receiptdate < date '1993-01-01' + interval '1' year group by l_shipmode order by l_shipmode limit 1;")
    sql_12.show()

    在这里插入图片描述

  2. sql_13

    val sql_13=spark.sql("select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%' group by c_custkey ) as c_orders group by c_count order by custdist desc, c_count desc limit 1;")
    sql_13.show()

    在这里插入图片描述

  3. sql_14

    val sql_14=spark.sql("select 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue from lineitem, part where l_partkey = p_partkey and l_shipdate >= date '1997-12-01' and l_shipdate < date '1997-12-01' + interval '1' month limit 1;")
    sql_14.show()

    在这里插入图片描述

  4. sql_15

    需要注册“revenue0”表以供查询,在spark sql不⽀持creat view

    val temp1=spark.sql("select l_suppkey as supplier_no, sum(l_extendedprice * (1 - l_discount)) as total_revenue from lineitem where l_shipdate >= date '1993-03-01' and l_shipdate < date '1993-03-01' + interval '3' month group by l_suppkey;")
    temp1.createOrReplaceTempView("revenue0")
    val sql_15=spark.sql("select s_suppkey, s_name, s_address, s_phone, total_revenue from supplier, revenue0 where s_suppkey = supplier_no and total_revenue = ( select max(total_revenue) from revenue0 ) order by s_suppkey;")
    sql_15.show()

    在这里插入图片描述

  5. sql_16

    val sql_16=spark.sql("select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#52' and p_type not like 'LARGE ANODIZED%' and p_size in (42, 38, 15, 48, 33, 3, 27, 45) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size limit 1;")
    sql_16.show()

    在这里插入图片描述

  6. sql_17

    val sql_17=spark.sql("select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'WRAP CASE' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey ) limit 1;")
    sql_17.show()

    在这里插入图片描述

  7. sql_18

    val sql_18=spark.sql("select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from customer, orders, lineitem where o_orderkey in ( select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 314 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate limit 100;")
    sql_18.show()

    在这里插入图片描述

  8. sql_19

    val sql_19=spark.sql("select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#54' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 5 and l_quantity <= 5 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#25' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 19 and l_quantity <= 19 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#42' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 24 and l_quantity <= 24 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) limit 1;")
    sql_19.show()

    在这里插入图片描述

  9. sql_20

    val sql_20=spark.sql("select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp where ps_partkey in ( select p_partkey from part where p_name like 'purple%' ) and ps_availqty > ( select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1994-01-01' and l_shipdate < date '1994-01-01' + interval '1' year ) ) and s_nationkey = n_nationkey and n_name = 'UNITED KINGDOM' order by s_name limit 1;")
    sql_20.show()

    在这里插入图片描述

  10. sql_21

    val sql_21=spark.sql("select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'FRANCE' group by s_name order by numwait desc, s_name limit 100;")
    sql_21.show()

    在这里插入图片描述

  11. sql_22

    val sql_22=spark.sql("select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( select substring(c_phone from 1 for 2) as cntrycode, c_acctbal from customer where substring(c_phone from 1 for 2) in ('33', '25', '16', '23', '32', '13', '19') and c_acctbal > ( select avg(c_acctbal) from customer where c_acctbal > 0.00 and substring(c_phone from 1 for 2) in ('33', '25', '16', '23', '32', '13', '19') ) and not exists ( select * from orders where o_custkey = c_custkey ) ) as custsale group by cntrycode order by cntrycode limit 1;")
    sql_22.show()

    在这里插入图片描述

4.2 运行时间分析

  1. 在firefox浏览器中进入到localhost:4040/,得到运行22条tpc-h查询指令时的监测数据。监测数据截图如下:

    在这里插入图片描述

    在这里插入图片描述

    在这里插入图片描述

  2. 点击最上方SQL,得到每条SQL语句对应的时间

    在这里插入图片描述

    在这里插入图片描述

在这里插入图片描述

其中,22条查询语句执行时间分别为:

查询语句执行时间
sql_11.1min
sql_22.0min
sql_31.5min
sql_41.7min
sql_52.0min
sql_640s
sql_72.4min
sql_82.2min
sql_92.4min
sql_101.4min
sql_1145s
sql_1260s
sql_1332s
sql_1457s
sql_151.8min
sql_1657s
sql_171.7min
sql_182.0min
sql_1948s
sql_201.7min
sql_213.2min
sql_2230s

5.优化

5.1 数据倾斜

由不同sql查询语句执行的时间可看出,不同语句之间差距并不大,均在1~3分钟左右完成,故而改进可能带来的效果作用不大,无需特地改进。

5.2 shuffle

shuffle write阶段,主要就是在一个stage结束计算之后,为了下一个stage可以执行shuffle类的算子(比如reduceByKey),而将每个task处理的数据按key进行“分类”。所谓“分类”,就是对相同的key执行hash算法,从而将相同key都写入同一个磁盘文件中,而每一个磁盘文件都只属于下游stage的一个task。在将数据写入磁盘之前,会先将数据写入内存缓冲中,当内存缓冲填满之后,才会溢写到磁盘文件中去。而未经优化的shuffle write操作所产生的磁盘文件的数量是极其惊人的。

而在shuffle read阶段,通常就是一个stage刚开始时要做的事情。此时该stage的每一个task就需要将上一个stage的计算结果中的所有相同key,从各个节点上通过网络都拉取到自己所在的节点上,然后进行key的聚合或连接等操作。由于shuffle write的过程中,task给下游stage的每个task都创建了一个磁盘文件,因此shuffle read的过程中,每个task只要从上游stage的所有task所在节点上,拉取属于自己的那一个磁盘文件即可。

在shuffle优化步骤中,具体而言,我们可以使用HashShuffleManager中的consolidateFiles参数,如果设置为true,那么就会开启consolidate机制,会大幅度合并shuffle write的输出文件,对于shuffle read task数量特别多的情况下,这种方法可以极大地减少磁盘IO开销,提升性能。代码如下:

import org.apache.spark.SparkConf
val conf=new SparkConf()
conf.set("spark.shuffle.consolidateFiles","true")

在这里插入图片描述

在设置shuffle参数前其情况如下:

在这里插入图片描述

而设置shuffle后:

在这里插入图片描述

可看出两者运行时间对比为33min与12min,优化程度较为明显。

从每一项查询结果来看也是如此:

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

查询语句名称原执行时间改进后执行时间倍数
sql_11.1min26s2.54
sql_22.0min55s2.18
sql_31.5min36s2.50
sql_41.7min26s3.92
sql_52.0min48s2.50
sql_640s11s3.64
sql_72.4min50s2.88
sql_82.2min52s2.54
sql_92.4min54s2.67
sql_101.4min33s2.55
sql_1145s19s2.37
sql_1260s21s2.86
sql_1332s12s2.67
sql_1457s17s3.35
sql_151.8min30s3.60
sql_1657s23s2.48
sql_171.7min37s2.76
sql_182.0min37s3.24
sql_1948s16s3.00
sql_201.7min33s3.09
sql_213.2min1.2min2.67
sql_2230s13s2.31

经计算,其优化速度倍数平均为2.8327倍,优化效果较为显著。

5.3 持久化

Spark 中一个很重要的能力是将数据持久化(或称为缓存),在多个操作间都可以访问这些持久化的数据。当持久化一个 RDD 时,每个节点的其它分区都可以使用 RDD 在内存中进行计算,在该数据上的其他 action 操作将直接使用内存中的数据。这样会让以后的 action 操作计算速度加快(通常运行速度会加速 10 倍)。缓存是迭代算法和快速的交互式使用的重要工具。

RDD 可以使用 persist() 方法或 cache() 方法进行持久化。数据将会在第一次 action 操作时进行计算,并缓存在节点的内存中。Spark 的缓存具有容错机制,如果一个缓存的 RDD 的某个分区丢失了,Spark 将按照原来的计算过程,自动重新计算并进行缓存。

代码改动:在每次建表时调用.cache()函数,使其存储在内存中。以supplier为例:

val data = spark.sparkContext.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/tpc-h/data/supplier.tbl").map(_.split('|')).map(p => Supplier(p(0).toInt, p(1).toString,p(2).toString,p(3).toInt,p(4).toString,p(5).toDouble,p(6).toString)).toDF().cache()

运行时间结果如下:

在这里插入图片描述

在这里插入图片描述

查询语句名称原执行时间改进后执行时间倍数
sql_11.1min1.1min1.00
sql_22.0min25s4.80
sql_31.5min24s3.75
sql_41.7min14s7.29
sql_52.0min19s6.31
sql_640s5s8.00
sql_72.4min16s9.00
sql_82.2min13s10.15
sql_92.4min37s3.89
sql_101.4min26s3.23
sql_1145s6s7.50
sql_1260s13s4.61
sql_1332s12s2.67
sql_1457s6s9.50
sql_151.8min33s3.27
sql_1657s32s1.78
sql_171.7min18s5.67
sql_182.0min18s6.67
sql_1948s9s5.33
sql_201.7min16s6.38
sql_213.2min24s8.00
sql_2230s12s2.50

对比可看出,持久化优化效果较为明显,执行时间范围由原本的13min优化到了1030s,基本都在1min甚至0.5min以内,经计算得平均每条运行时间优化倍数为5.5136倍。由此可见,持久化优化效果十分明显。

5.4 Spark多核运行

在运行spark时可以设置spark的核数:

./bin/spark-shell --master local[12]

在这里插入图片描述

以下两张图分别为单核(cores=1)和多核(cores=12)下运行的参数情况:

在这里插入图片描述
在这里插入图片描述

而具体执行时间如下:

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

查询语句原执行时间改进后执行时间倍数
sql_11.1min28s2.36
sql_22.0min54s2.22
sql_31.5min42s2.14
sql_41.7min34s3.00
sql_52.0min1.0min2.00
sql_640s14s2.86
sql_72.4min1.1min2.18
sql_82.2min53s2.49
sql_92.4min1.1min2.18
sql_101.4min37s2.27
sql_1145s20s2.25
sql_1260s24s2.50
sql_1332s13s2.46
sql_1457s19s3.00
sql_151.8min36s3.00
sql_1657s26s2.19
sql_171.7min40s2.55
sql_182.0min47s2.55
sql_1948s19s2.52
sql_201.7min43s2.37
sql_213.2min1.7min1.88
sql_2230s16s1.88

经计算可得,其平均值为2.40,即平均优化后查询速度为优化前的2.40倍。可见优化效果较为明显。

5.5 虚拟机设置

将虚拟机核数(处理器)从1设置为2,查看效果

在这里插入图片描述

在这里插入图片描述

而具体到每条查询来看

查询语句原执行时间改进后执行时间倍数
sql_11.1min27s2.44
sql_22.0min1.1min1.82
sql_31.5min1.1min1.36
sql_41.7min26s3.92
sql_52.0min55s2.18
sql_640s13s3.08
sql_72.4min56s2.57
sql_82.2min48s2.75
sql_92.4min56s2.57
sql_101.4min32s2.63
sql_1145s20s2.25
sql_1260s21s2.86
sql_1332s11s2.91
sql_1457s19s3.00
sql_151.8min38s2.84
sql_1657s23s2.48
sql_171.7min35s2.91
sql_182.0min42s2.86
sql_1948s18s2.67
sql_201.7min39s2.61
sql_213.2min1.2min2.67
sql_2230s13s2.31

经计算,虚拟机核数改进后平均速度改进倍数约为2.6223,较为显著。

6.结果与分析

在对22条查询语句的测试与优化中,我们发现在各种优化方式的效果排名如下:持久化优化(5.5136倍) > shuffle优化(2.8327) > 虚拟机多核运行(2.6223) > Spark 多核运行(2.40),其中持久化优化效果显著优于其他三种优化方式。

除此之外,我们也关注了数据分区、数据倾斜等方法,但是该数据集数据安排较为合理,各查询语句之间时间差不明显;而tpc-h的数据时间以天划分、数量巨大,难以根据时间做数据分区。

7.心得体会

通过这次实习,我们了解了tpc-h这一benchmark,通过对tpc-h的22条查询,我们深入体验了spark的建表操作以及查询操作,并且在该22条查询语句的基础上尝试了各种优化,通过这一系列实验,对spark的数据建立、数据存储、查询操作、优化方法以及相关原理都有了较为深刻的了解与体会。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值