Spark 读写Hive分桶表

 

I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.

I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?

EDIT: For writing out the Hive bucketed tables I was using something like:

customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
 giftDF.write
.partitionBy("plat","date")
.mode(SaveMode.Append)
.bucketBy(20,"room_id")
.parquet(path)

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值