spark优化之分区插入

spark.sql.sources.partitionOverwriteModeSTATIC

When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. In static mode, Spark deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. In dynamic mode, Spark doesn't delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. By default we use static mode to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode. This can also be set as an output option for a data source using key partitionOverwriteMode (which takes precedence over this setting), e.g. dataframe.write.option("partitionOverwriteMode", "dynamic").save(path).

2.3.0
// 覆盖指定分区
table("tv_group").write.option("partitionOverwriteMode", "dynamic").partitionBy("store_id", "group_id").
      mode(SaveMode.Overwrite).save("xxx")

// 删除所有分区再插入
table("tv_group").write.option("partitionOverwriteMode", "STATIC").partitionBy("store_id", "group_id").
      mode(SaveMode.Overwrite).save("xx")

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode).

The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).

is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).

Question can be considered as an extension to :How to load DataFrame directly to Hive in Spark

any help is much appreciated.

------------------------

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

or

df.write.partitionBy('year', 'month').insertInto(...)

Tried this Partitionby method. It only works on RDD level, once dataframe is created most of the methods are DBMS styled e.g. groupby, orderby but they don't serve the purpose of writing in different partitions folders on Hive. 

Ok, so was able to work it out with 1.4 version. df.write().mode(SaveMode.Append).partitionBy("date").saveAsT‌​able("Tablename"); . This however changes my date field to integer value and remove the actual date. e.g. there are 9 unique dates in the column but they are now stored as 1,2,3.... and folder name is date=1,2,3,... instead of date=20141121. Let me know if there is a way to do this

------------------

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

Where should i set the above 2 parameters ? I tried logging in hive shell and run above commands, it failed. i am sure i am doing it wrong. Could you please tell where can i set these properties ?

@VrushankDoshi You would set it in the spark program, right after you create your hiveContext. val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.setConf("hive.exec.dynamic.partition","true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

--------------

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

------------

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

or

df.write.partitionBy('year', 'month').insertInto(...)

----------------------

$SPARK_HOME/bin/spark-submit --files $SPARK_HOME/conf/hive-site.xml --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.bl.bigdata.cdp.execservice.service.batch.TestApp --name TestApp_wsw /var/lib/hive/ww/TestApp.jar

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大怀特

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值