文章最前: 我是Octopus,这个名字来源于我的中文名--章鱼;我热爱编程、热爱算法、热爱开源。所有源码在我的个人github ;这博客是记录我学习的点点滴滴,如果您对 Python、Java、AI、算法有兴趣,可以关注我的动态,一起学习,共同进步。
相关文章:
文章目录:
1. 创建pyspark的DataFrame
employee_salary = [
("zhangsan", "IT", 100,"2023-12-01"),
("lisi", "IT", 1,"2023-12-01"),
("wangwu", "IT", 1,"2023-12-01"),
("zhaoliu", "ALGO", 1,"2023-12-01"),
("qisan", "IT", 1002,"2023-12-01"),
("bajiu", "ALGO", 11,"2023-12-01"),
("james", "ALGO", 12,"2023-12-01"),
("wangzai", "INCREASE", 12,"2023-12-01"),
("carter", "INCREASE", 14,"2023-12-01"),
("kobe", "IT", 152,"2023-12-02")]
columns= ["name", "department", "salary", "dt"]
df = spark.createDataFrame(data = employee_salary, schema = columns)
df.show()
2.hive表的结构
CREATE TABLE if NOT EXISTS tmp.table_5_15 (
name string comment 'name',
department string comment 'department',
salary int comment 'salary'
)
partitioned by
(dt string COMMENT '分区字段')stored as parquettblproperties('parquet.compression'='SNAPPY'
3.pyspark的DSL风格写入Hive表
spark.sql("set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
df.write.format("Hive") \
.mode('overwrite') \
.partitionBy("dt") \
.option("header", "false") \
.option("delimiter", "\t") \
.saveAsTable("tmp.table_5_15")
pySpark直接存储hive,这里的"dt"是分区字段
mode分为"overwrite"'和”append"
"append”是向表中添加数据
"overwrite"是重新建表再写,意味着会删除原本的所有数据,而不仅仅只删除当前分区的数据
4.DSL风格写入hive表不删除 其他分区
configs = [
('spark.app.name', 'algo2_spark2_demo'),
('spark.driver.memory', '4g'),
('spark.executor.memory', '4g'),
('spark.executor.instances', '2'),
('spark.executor.cores', '2'),
('spark.kryoserializer.buffer.max','128m'),
("hive.exec.dynamic.partition.mode", "nonstrict") ,
("spark.sql.sources.partitionOverwriteMode", "dynamic")
]
conf.setAll(configs)
sc = SparkContext.getOrCreate(conf=conf)
spark = HiveContext(sc)
df.write.insertInto("tmp.table_5_15", overwrite=True)
这个方式插入数据不会影响其他分区。只影响被插入的分区,如果新插入分区数据比较少,也会覆盖掉之前写入的分区。
插入之前:
插入数据:
插入之后:
5.调用spark sql 写入
创建createOrReplaceTempView或createOrReplaceGlobalTempView,然后使用spark sql插入到指定hive表中,但要求满足以下两点:
1)指定的hive表是存在的
2) createOrReplaceTempView的schema结构顺序与指定Hive表的schema结构顺序是一致的。
# hive动态参数设置
set hive.exec.dynamici.partition=true; #开启动态分区,默认是false
set hive.exec.dynamic.partition.mode=nonstrict; #开启允许所有分区都是动态的,否则必须要有静态分区才能使用
# hive实现
insert overwrite table tmp.table_5_15 partition(date)
select * from viewTmp;