刚发现spark写hive的是用overwrite后会把整个表删了,在执行插入,导致所有的partition都被删除了。期望是能按照partition去覆盖,而不是全表覆盖。研究了一下,以下方法亲测可行:
建表语句:
CREATE TABLE `student_table`(
`id` string,
`name` string
)
PARTITIONED BY (
`dt` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
spark代码:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ClearData {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("test")
.co