场景:
在业务场景中,会经常有join或者group by操作,这样会使数据打散,使Parquet无法达到最大的压缩比,使用Cluster By使相同的key聚合排序,达到Parquet最大的压缩比
基础知识:要熟悉以下概念,简单介绍一下
Distribute By:reduce阶段key值聚合分发
Sort By:分组排序
Cluster By = Distribute By + Sort By
Parquet:列存储模式 + 列压缩
优化示例:
CREATE TABLE IF NOT EXISTS tmp.test(
id string COMMENT ,
feature string COMMENT ,
value string COMMENT
)
PARTITIONED BY (
data_date bigint COMMENT '时间分区'
);
INSERT OVERWRITE TABLE tmp.test partition(data_date=001)
SELECT id, alias_name, value
FROM (
SELECT alias_name, feature
FROM tmp.mapping
WHERE data_date = 20200618
) a
JOIN (
SELECT id, feature, value
FROM tmp.source
WHERE data_date = 20200706
) b
ON a.feature = b.feature;
INSERT OVERWRITE TABLE tmp.test partition(da