之前只知道count distinct的性能不好,但是不知道为何不好:原因在于会出现数据膨胀。
原理总结:count distinc 在spark上运行,会基于Expand, Exploded Shuffle, Partial Aggregations,多个count distinct转化为count
深度好文,必读:
- Distributed COUNT DISTINCT – How it Works in Spark, Multiple COUNT DISTINCT, Transform to COUNT with Expand, Exploded Shuffle, Partial Aggregations 链接
- Spark SQL distinct分析优化总结 链接
实战方案:
提高并行度,最后合并小文件
## 小文件合并
SET spark.sql.mergeSmallFileSize=20971520;
SET spark.hadoopRDD.targetBytesInPartition=67108864;
SET spark.sql.adaptive.shuffle.targetPostShuffleInputSize=134217728;
## 分区合并参数关闭
set spark.sql.adaptive.enabled=false;
## 并行度200
SET spark.sql.shuffle.partitions=2000;