sparkSql动态插入hive分区表

最新推荐文章于 2024-08-26 03:08:14 发布

麦田里的虫子

最新推荐文章于 2024-08-26 03:08:14 发布

阅读量7.8k

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/ty555ty/article/details/99602928

版权

本文介绍了如何使用SparkSQL动态插入Hive分区表，包括创建Hive分区表的步骤，设置Sparksession，从Oracle读取数据，以及动态分区的配置。还探讨了在HDFS上对数据进行重分布的场景，如使用coalesce和repartition方法，并讨论了它们在不同分区数量情况下的适用性。最后，展示了插入后的Hive分区表在HDFS上的结构。

摘要由CSDN通过智能技术生成

前提条件：hive中创建分区表，并指定分区键
create table test(
id stirng
)partitioned by (name string)
stored as orc;

创建sparksession，不需要认证的话去掉config中内容

        SparkSession ss =  SparkSession.builder()
                .appName("test ")
                .master("local[2]"
                .enableHiveSupport() 
                .config("spark.sql.authorization.enabled", true)  
                .config("hive.security.authorization.enabled", true)
                .getOrCreate();

读取oracle中数据

 String sql = "(select xxx.*, 'a' as name from xxx where time >= trunc(sysdate)  ) a";
 Dataset<Row> table = ss.read()
            .format("jdbc")
            .option("driver", "oracle.jdbc.OracleDriver")
            .option("url", "jdbc:oracle:thin:@1.1.1.1:1521:orcl")
            .option("user", &#