Spark3.0 Sql 使用HiveTableScanExec 读取Hive orc表源码分析及参数调优
1 环境准备
1.1 示例代码
import org.apache.spark.sql.SparkSession
object SparkSqlHive {
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder().master("local[2]").appName("the test of SparkSession")
.config("spark.sql.hive.convertMetastoreOrc", "false") // 默认true使用FileSourceScanExec算子读取,故设置为false
.config("spark.hadoop.mapreduce.input.fileinputformat.split.maxsize","67108864")
.enableHiveSupport()
.getOrCreate()
// 临时表存在则先删除
ss.sql("DROP TABLE IF EXISTS temp.temp_ods_start_log");
// 读取orc表ods_start_log 数据 存到临时表中
val df = ss.sql("CREATE TABLE IF NOT EXISTS temp.temp_ods_start_log as select substr(str,1,10) as str10 from biods.ods_start_log where dt='20210721'")
// action算子,触发job启动
df.count()
// 线程休眠一段时间,便于spark ui上观察分析
Thread.sleep(1000000)
ss.stop()
}
}
1.2 hive orc表准备
由于orc表无法加载txt数据,故先把数据加载txt表,再写入orc表。
-- 创建数据库
create database biods;
-- 创建orc外部表
create external table biods.ods_start_log
(
`str` string
)
comment '用户启动日志信息'
partitioned by (`dt` string)
stored as orc
location '/bi/ods/ods_start_log';
-- 创建txt外部表
create external table biods.txt_ods_start_log
(
`str` string
)
comment '用户启动日志信息'
partitioned by (`dt` string)
stored as textfile
location '/bi/ods/txt_ods_start_log';
-- 添加分区
alter table biods.ods_start_log add partition(dt='20210721');
alter table biods.txt_ods_start_log add partition(dt='20210721');
-- 加载数据
load data local inpath '/home/cwc/data/start0721.log' overwrite into table biods.txt_ods_start_log partition(dt='20210721');
-- 写入orc表
insert overwrite table biods.ods_start_log
partition(dt='20210721')
select str
from biods.txt_ods_start_log
where dt='20210721';
--hdfs 合并多个文件为1个文件(如有需要)
alter table biods.ods_start_log partition(dt='20210721') concatenate;
最终构造分区文件如下:
[root@hadoop3 ~]# hdfs dfs -ls -R -h hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721
drwxr-xr-x - root supergroup 0 2022-10-22 12:29 hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/.hive-staging_hive_2022-10-22_12-29-16_934_837446341335257460-1
drwxr-xr-x - root supergroup 0 2022-10-22 12:29 hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/.hive-staging_hive_2022-10-22_12-29-16_934_837446341335257460-1/-ext-10001
drwxr-xr-x - root supergroup 0 2022-10-22 12:29 hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/.hive-staging_hive_2022-10-22_12-29-16_934_837446341335257460-1/_tmp.-ext-10002
-rwxr-xr-x 3 root supergroup 246.0 M 2022-10-25 12:18 hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000000_1
-rwxr-xr-x 3 root supergroup 94.1 M 2022-10-25 12:18 hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000001_0
文件的块大小如下,为256M
[root@hadoop3 ~]# hadoop fs -stat "%o %r" hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000001_0
268435456 3
[root@hadoop3 ~]# hadoop fs -stat "%o %r" hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000000_1
268435456 3
orc文件的stripe个数如下:
94m的文件有2个stripe, 246m的文件有11个stripe
[root@hadoop3 ~]# hive --orcfiledump hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000001_0 | less
Processing data file hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000001_0 [length: 98673168]
Structure for hdfs://hadoop1:9000/bi/ods/ods_start_log/dt=20210721/000001_0
File Version: 0.12 with ORC_135
Rows: 3043150
Compression: ZLIB
Compression size: 262144
Type: struct<str:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 2020000 hasNull: false
Column 1: count: 2020000 hasNull: false min: 2021-09-16 16:26:46.194 [main] INFO com.lagou.ecommerce.AppStart -