Intro
把经过处理之后的dataframe直接写入到hive中,有几种方式,分别举例说明。
普通表
spark.sql
非常常见的一种方式是,通过建立临时表,写sql的方式写入
// 1、DataFrame建临时表,通过spark.sql的方式回写进去
df11.createOrReplaceTempView("temp")
spark.sql("insert into tmp.test2hive select * from temp")
// 删除临时表
spark.catalog.dropTempView("temp")
println("--- spark.sql:insert table success!")
spark.sql("select * from tmp.test2hive").show(numRows = 20, truncate = true)
spark.write.insertInto
insertInto方法有点小坑,插入hive时,他会校验df的列名是否都在hive列名中,如果在则直接插入,但是不考虑顺序。
- 如果表有两列x,y,df也有两列,是y,x
- 插表时,df的y会插在第一列
df.write.mode(SaveMode.Append).insertInto("tmp.test2hive")
spark.write.saveAsTable
该方法会对列名做检验,并且会做一次对齐。注意format这里要写明,不然可能会报错。
df.write.format("Hive").mode(SaveMode.Append).saveAsTable("tmp.test2hive")
分区表
分区表没啥好说的,就是一个动态分区的事。如果df的ds分区列有多个值,怎么才能插入多个分区,直接代码,不多说。
// 插入某一个具体分区
spark.sql("insert into tmp.test2hive_partition partition (ds=20200101) select id,x,y from temp")`
// 动态插入分区
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df22.write.format("Hive").mode(SaveMode.Overwrite).insertInto("tmp.test2hive_partition")
Demo
看下代码和日志:
package com.pandas.main
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SaveMode
/**
* @auther pandas
* @date 2020/12/14 16:31
* @project
*/
/**
* scala读写hive的代码demo,包含部分:
* 1、
*/
object scalaREWRHive {
def main(args: Array[String]): Unit = {
val spark = getOrCreateSpark("tempApp")
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
println("----------------------------- 通过spark.sql建表 ------------------------------")
val createSql =
"""CREATE TABLE IF NOT EXISTS tmp.test2hive(
|id STRING COMMENT '主键'
|,x INT COMMENT 'X'
|,y INT COMMENT 'Y'
|)
|STORED AS PARQUET""".stripMargin
val createPartitionSql =
"""CREATE TABLE IF NOT EXISTS tmp.test2hive_partition(
|id STRING COMMENT '主键'
|,x INT COMMENT 'X'
|,y INT COMMENT 'Y'
|) PARTITIONED BY (ds STRING)
|STORED AS ORC tblproperties ('orc.compress'='SNAPPY')""".stripMargin
readHive(spark, "drop table if exists tmp.test2hive ")
readHive(spark, "drop table if exists tmp.test2hive_partition ")
readHive(spark, createSql)
println("------ create usual table")
readHive(spark, createPartitionSql)
println("------ create partition table")
println("----------------------------- 通过spark.sql插入数据到普通表中 ------------------------------")
val df11 = Seq((1, 10, 11), (2, 20, 21)).toDF("id", "x", "y")
df11.show(truncate = false)
// 1、DataFrame建临时表,通过spark.sql的方式回写进去
df11.createOrReplaceTempView("temp")
spark.sql("insert into tmp.test2hive select * from temp")
// 删除临时表
spark.catalog.dropGlobalTempView("temp")
println("--- spark.sql:insert table success!")
spark.sql("select * from tmp.test2hive").show(numRows = 20, truncate = true)
// 2、使用write方法写入
val df12 = Seq((3, 30, 31), (4, 40, 41)).toDF("x", "id", "y")
// 直接按照现在dataframe顺序进行插入,如果列名出现偏差,有可能偏移的
df12.write.mode(SaveMode.Append).insertInto("tmp.test2hive")
val df13 = Seq((5, 50, 51), (6, 60, 61)).toDF("y", "x", "id")
// 可以根据列名进行对齐
df13.write.format("Hive").mode(SaveMode.Append).saveAsTable("tmp.test2hive")
println("append write")
println("--- spark.write:insert table success!")
spark.sql("select * from tmp.test2hive").show(numRows = 20, truncate = true)
println("----------------------------- 通过spark.sql插入数据到分区表中 ------------------------------")
val df21 = Seq((1, 10, 11, 20200101), (2, 20, 21, 20200101)).toDF("id", "x", "y", "ds")
// 1、DataFrame建临时表,通过spark.sql的方式回写进去
df21.createOrReplaceTempView("temp")
spark.sql("insert into tmp.test2hive_partition partition (ds=20200101) select id,x,y from temp")
// 2、使用write方法写入
val df22 = Seq((3, 30, 31, 20200102), (4, 40, 41, 20200103)).toDF("id", "x", "y", "ds")
//-- 插入动态分区
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df22.write.format("Hive").mode(SaveMode.Overwrite).insertInto("tmp.test2hive_partition")
spark.sql("select * from tmp.test2hive_partition").show(truncate = false)
spark.stop()
}
def getOrCreateSpark(appName: String = "tempApp"): SparkSession = {
val builder = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true") // 是否启动hive动态分区
.config("hive.exec.dynamic.partition.mode", "nonstrict") //打开动态分区后,动态分区的模式
.config("spark.executor.heartbeatInterval", "60s") //executor 向 the driver 汇报心跳的时间间隔,单位毫秒
.config("spark.network.timeout", "120s")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //序列化对象使用的类,建议选用前面的参数
.config("spark.kryoserializer.buffer.max", "512m") //序列化缓存允许的最大值
.config("spark.dynamicAllocation.enabled", "false") //是否开启动态资源搜集
.config("spark.sql.inMemoryColumnarStorage.compressed", "true")
.config("spark.sql.inMemoryColumnarStorage.batchSize", 10000)
.config("spark.sql.broadcastTimeout", 600)
.config("spark.sql.autoBroadcastJoinThreshold", -1)
.config("spark.sql.crossJoin.enabled", true)
.config("spark.sql.files.maxRecordsPerFile", 100000)
.config("spark.sql.parquet.writeLegacyFormat", "true")
val spark = builder.getOrCreate()
spark
}
def plusTest(args: Array[Int]): Unit = {
println("plus sum:", args.sum)
}
def readHive(spark: SparkSession, sql: String): DataFrame = {
spark.sql(sql)
}
def insertHive(df: DataFrame): Unit = {
}
}
----------------------------- 通过spark.sql建表 ------------------------------
------ create usual table
------ create partition table
----------------------------- 通过spark.sql插入数据到普通表中 ------------------------------
+---+---+---+
|id |x |y |
+---+---+---+
|1 |10 |11 |
|2 |20 |21 |
+---+---+---+
--- spark.sql:insert table success!
+---+---+---+
| id| x| y|
+---+---+---+
| 1| 10| 11|
| 2| 20| 21|
+---+---+---+
append write
--- spark.write:insert table success!
+---+---+---+
| id| x| y|
+---+---+---+
| 3| 30| 31|
| 1| 10| 11|
| 4| 40| 41|
| 2| 20| 21|
| 51| 50| 5|
| 61| 60| 6|
+---+---+---+
----------------------------- 通过spark.sql插入数据到分区表中 ------------------------------
+---+---+---+--------+
|id |x |y |ds |
+---+---+---+--------+
|1 |10 |11 |20200101|
|2 |20 |21 |20200101|
|3 |30 |31 |20200102|
|4 |40 |41 |20200103|
+---+---+---+--------+
2020-12-23 于南京江宁区九龙湖

308

被折叠的 条评论
为什么被折叠?



