Spark SQL处理小文件_spark-sql repartition-CSDN博客

本文链接：https://blog.csdn.net/lhxsir/article/details/99588064

生产环境DataNode仅有7个，每个datanode文件数阈值50w块，
也就是说整个集群共能容纳7 * 50w=350w / 3个副本 =120w块！
有张表按照年月日分区=10年12月365天=4.4w块，仅能存25张此表。
正常情况下分布：
10000块 * 20张、
1000块 * 200张、
100块 * 2000张、
10块 * 20000张、
实际生产中会产生很多小文件，占用集群资源，为此很头疼，必须妥善处理，方法如下：
在这里插入图片描述
方法一使用repartition

 df_1.repartition(4).createOrReplaceTempView("tmp_test")
hiveContext.sql("insert overwrite table asmp.tt_test select * from tmp_test")

根据实际情况：
1亿条数据量大约40G（年数据量），按照目前集群性能：
repartition(4)需要25min左右
repartition(10)需要10min左右
假如采用 repartition(4)方式，某个workflow共生成3个宽表计算的话需要额外花费25 * 3=75min
假如采用 repartition(10)方式，某个workflow共生成3个宽表计算的话需要额外花费10 * 3=30min
也就是说通过repartition的形式减少小文件并不理想，实际还是需要写程序单独处理小文件个数。然后定时任务1个月执行1次！！！

方法二使用参数配置

val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
--启用Adaptive Execution，从而启用自动设置Shuffle 
hiveContext.setConf("spark.sql.adaptive.enabled","true")
--设置每个Reducer读取的目标数据量，默认64M，一般改成集群块大小
hiveContext.setConf("spark.sql.adaptive.shuffle.targetPostShuffleInputSize","128000000")

其它配置参考：

--支持hive动态分区
hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
--禁用Spark SQL自带的Parquet SerDe
hiveContext.setConf("spark.sql.hive.convertMetastoreParquet","false")

--
hiveContext.setConf("spark.sql.adaptive.join.enabled","true")
--此参数默认值是10485760（10M），设置为-1可以禁BroadcastHashJoin
hiveContext.setConf("spark.sql.adaptiveBroadcastJoinThreshold","64000000")
hiveContext.setConf("spark.sql.adaptive.allowAdditionalShuffle","true")