spark写表指定外部表_Spark结构化流写入流到Hive ORC分区外部表

最新推荐文章于 2023-05-05 14:10:25 发布

陈子皮

最新推荐文章于 2023-05-05 14:10:25 发布

阅读量382

点赞数

文章标签： spark写表指定外部表

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_36430300/article/details/113541707

版权

我正在尝试使用Spark Structured Streaming - writeStream API来写入外部分区的Hive表 .

CREATE EXTERNAL TABLE `XX`(

`a` string,

`b` string,

`b` string,

`happened` timestamp,

`processed` timestamp,

`d` string,

`e` string,

`f` string )

PARTITIONED BY (

`year` int, `month` int, `day` int)

CLUSTERED BY (d)

INTO 6 BUCKETS

STORED AS ORC

TBLPROPERTIES (

'orc.compress'='ZLIB',

'orc.compression.strategy'='SPEED',

'orc.create.index'='true',

'orc.encoding.strategy'='SPEED');

并在Spark代码中，

val hiveOrcWriter: DataStreamWriter[Row] = event_stream

.writeStream

.outputMode("append")

.format("orc")

.partitionBy("year","month","day")

//.option("compression", "zlib")

.option("path", _table_loc)

.option("checkpointLocation", _table_checkpoint)

我看到在非分区表上，记录被插入到Hive中 . 但是，在使用分区表时，spark作业不会失败或引发异常，但记录不会插入Hive表 .

感谢处理类似问题的任何人的评论 .

Edit ：

刚刚发现.orc文件确实写入HDFS，具有正确的分区目录结构：例如 . /_table_loc/_table_name/year/month/day/part-0000-0123123.c000.snappy.orc

然而

select * from 'XX' limit 1; (or where year=2018)

不返回任何行 .

表'XX'的 InputFormat 和 OutputFormat 分别为 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 和 org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat .

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark写表指定外部表_Spark结构化流写入流到Hive ORC分区外部表

我正在尝试使用Spark Structured Streaming - writeStream API来写入外部分区的Hive表 .CREATE EXTERNAL TABLE `XX`(`a` string,`b` string,`b` string,`happened` timestamp,`processed` timestamp,`d` string,`e` string,`f` s...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。