Impala,Hive,SparkSQL数据清洗后对后续查询的影响比较

Impala,Hive,SparkSQL数据清洗后对后续查询的影响比较

SparkSQL和Impala执行的时间会比较快,Hive明显慢很多,但是这里比较不是执行单次查询效率,而是三种方式清洗后数据产生结果小文件对后续使用的影响。
首先准备好2000万表记录,作为外部表建好。
然后把出生日期星座划分,结果生成表格存放。三种方式存放3个不同表格。
###############################  hive 生成表 external_2000w_hive
create table external_2000w_hive as
select Birthday,
CAST(substring(Birthday,5) AS INT) as born_day,
CASE
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 120 and CAST(substring(Birthday,5) AS INT)  <= 219 THEN "水瓶座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 220 and CAST(substring(Birthday,5) AS INT)  <= 320 THEN "双鱼座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 321 and CAST(substring(Birthday,5) AS INT)  <= 420 THEN "白羊座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 421 and CAST(substring(Birthday,5) AS INT)  <= 521 THEN "金牛座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 522 and CAST(substring(Birthday,5) AS INT) <= 621 THEN "双子座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 622 and CAST(substring(Birthday,5) AS INT) <= 722 THEN "巨蟹座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 723 and CAST(substring(Birthday,5) AS INT) <= 823 THEN "狮子座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 824 and CAST(substring(Birthday,5) AS INT) <= 923 THEN "处女座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 924 and CAST(substring(Birthday,5) AS INT) <= 1023 THEN "天秤座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 1024 and CAST(substring(Birthday,5) AS INT)  <= 1122 THEN "天蝎座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 1123 and CAST(substring(Birthday,5) AS INT) <= 1222 THEN "射手座"
WHEN (length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT)  >= 120 and CAST(substring(Birthday,5) AS INT)  <= 1231) 
or (length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT)  >= 101 and CAST(substring(Birthday,5) AS INT) <= 119) THEN "摩蝎座"
ELSE "未知"
END AS XingZuo
from external_2000w
where name <> 'Name';
###############################  spark-sql 生成表 external_2000w_spark
create table external_2000w_spark as
select Birthday,
CAST(substring(Birthday,5) AS INT) as born_day,
CASE
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 120 and CAST(substring(Birthday,5) AS INT)  <= 219 THEN "水瓶座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 220 and CAST(substring(Birthday,5) AS INT)  <= 320 THEN "双鱼座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 321 and CAST(substring(Birthday,5) AS INT)  <= 420 THEN "白羊座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 421 and CAST(substring(Birthday,5) AS INT)  <= 521 THEN "金牛座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 522 and CAST(substring(Birthday,5) AS INT) <= 621 THEN "双子座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 622 and CAST(substring(Birthday,5) AS INT) <= 722 THEN "巨蟹座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 723 and CAST(substring(Birthday,5) AS INT) <= 823 THEN "狮子座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 824 and CAST(substring(Birthday,5) AS INT) <= 923 THEN "处女座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 924 and CAST(substring(Birthday,5) AS INT) <= 1023 THEN "天秤座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 1024 and CAST(substring(Birthday,5) AS INT)  <= 1122 THEN "天蝎座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 1123 and CAST(substring(Birthday,5) AS INT) <= 1222 THEN "射手座"
WHEN (length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT)  >= 120 and CAST(substring(Birthday,5) AS INT)  <= 1231) 
or (length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT)  >= 101 and CAST(substring(Birthday,5) AS INT) <= 119) THEN "摩蝎座"
ELSE "未知"
END AS XingZuo
from external_2000w
where name <> 'Name';
###############################  impala 生成表 external_2000w_impala
create table external_2000w_impala as
select Birthday,
CAST(substring(Birthday,5) AS INT) as born_day,
CASE
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 120 and CAST(substring(Birthday,5) AS INT)  <= 219 THEN "水瓶座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 220 and CAST(substring(Birthday,5) AS INT)  <= 320 THEN "双鱼座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 321 and CAST(substring(Birthday,5) AS INT)  <= 420 THEN "白羊座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 421 and CAST(substring(Birthday,5) AS INT)  <= 521 THEN "金牛座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 522 and CAST(substring(Birthday,5) AS INT) <= 621 THEN "双子座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 622 and CAST(substring(Birthday,5) AS INT) <= 722 THEN "巨蟹座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 723 and CAST(substring(Birthday,5) AS INT) <= 823 THEN "狮子座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 824 and CAST(substring(Birthday,5) AS INT) <= 923 THEN "处女座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 924 and CAST(substring(Birthday,5) AS INT) <= 1023 THEN "天秤座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 1024 and CAST(substring(Birthday,5) AS INT)  <= 1122 THEN "天蝎座"
WHEN length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT) >= 1123 and CAST(substring(Birthday,5) AS INT) <= 1222 THEN "射手座"
WHEN (length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT)  >= 120 and CAST(substring(Birthday,5) AS INT)  <= 1231) 
or (length(Birthday) = 8 and CAST(substring(Birthday,5) AS INT)  >= 101 and CAST(substring(Birthday,5) AS INT) <= 119) THEN "摩蝎座"
ELSE "未知"
END AS XingZuo
from external_2000w
where name <> 'Name';
########################################################################
3个表格生成后,查询,检查记录。如下:

[hadoop@snn ~]$ spark-sql
15/12/20 15:11:06 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
SET hive.support.sql11.reserved.keywords=false
15/12/20 15:12:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
spark-sql> show tables;
external_2000w  false
external_2000w_hive     false
external_2000w_impala   false
external_2000w_spark    false
external_bz2_tbl_6005   false
external_tar_gz_tbl_6005        false
external_tbl_6005       false
hive_user_info  false
pcp     false
pcp2    false
pcp2_16787      false
sqoop_test      false
tbl_6005        false
test_table_16538        false
test_table_16539        false
Time taken: 3.28 seconds, Fetched 15 row(s)
spark-sql>
spark-sql> select count(1) as cnt from external_2000w_hive;
20051429
Time taken: 17.677 seconds, Fetched 1 row(s)
spark-sql> select count(1) as cnt from external_2000w_spark;
20051429
Time taken: 10.313 seconds, Fetched 1 row(s)
spark-sql> select count(1) as cnt from external_2000w_impala;
20051440
Time taken: 8.483 seconds, Fetched 1 row(s)
spark-sql>

发现Impala执行的结果有一点问题,原始外部表是11个文件,每个文件有个列头,所以加“where name <> 'Name'”,但是看到impala并没有过滤掉这11条记录。
暂时不管这个。待查原因
######################
生成3个表格在后台的文件存储情况
--hive
hive 任务分11个MAP执行,结果也是11个文件。(默认:没有设置输出结果合并,hive含有结果文件合并的开关。)
插图1:

插图2:


--impala
impala 任务分31个部分执行,结果4个文件。
插图1:

插图2:


--spark
spark 任务分31个tasks执行,结果也是31个文件。
插图1:

插图2:


########################
spark中对任务的监控:
插图:

不难看出,对于3张内容一样,但是存储文件不同,对于后续的查询影响还是比较明显的。

执行的文件数量多,开销多,shuffle也多,查询时间的task数量和实验成倍数增长。


没有更多推荐了,返回首页